Join Paper Club with Princeton University on Model Alignment Challenges in Preference Learning

Nov

Tuesday

Tuesday, November 26th, 2024 • 12PM to 1PM (EDT)

Virtual Meeting

Event Ended

This event has already taken place.

Attendees 266+ registered

Attendees feature CTOs (kaiko.ai, Vital.ai), Distinguished Engineers (Morgan Stanley), and ML experts building production RLHF, LLM safety, and Graph RAG systems.

Join Our Paper Club Event Series! Meet with Sadhika Malladi, AI Researcher at Princeton University and discuss the challenges of aligning language models with human preferences.

Don’t miss this unique opportunity: Hear directly from the researcher & join a live Q&A!

☝️ Register Above for this Live Virtual Meeting with the Researcher! ☝️

Info	Details
Event	Paper Club with Sadhika Malladi on “Preference Learning Algorithms Do Not Learn Preference Rankings” and “Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization”
Date & Time	November 26, 2024, 12:00 PM EST
Presenter	Sadhika Malladi, AI Researcher, Princeton University
Research Papers	📄 Preference Learning Algorithms Do Not Learn Preference Rankings and 📄 Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Audio Version	By Paper2Audio
Audio Version	By Paper2Audio

Meet the Researcher

Meet Sadhika Malladi, an AI researcher focused on preference learning and alignment in AI systems. Her recent work explores the limitations of current preference learning algorithms and highlights the risks associated with likelihood displacement during training.

Key Insights from the Papers:

For AI/ML engineers, aligning language models (LLMs) with human preferences using methods like RLHF (Reinforcement Learning with Human Feedback) and DPO (Direct Preference Optimization) presents both opportunities and challenges. We will walk through the recent Princeton University research on risks and best mitigation practices.

Key Risks to Watch Out For:

Ranking Accuracy Gap: Even state-of-the-art models often fail to rank preferred outputs accurately, achieving less than 60% ranking accuracy. This highlights a disconnect between current training objectives and the desired model behavior.
Likelihood Displacement: Training can unintentionally decrease the likelihood of preferred responses and shift probability mass to harmful or incorrect outputs. For example, a model trained to refuse unsafe prompts saw its refusal rate drop from 74.4% to 33.4%, introducing unintended risks.
Overlapping Preferences in Data: Preferences that are too similar in the training dataset can exacerbate alignment issues, leading to models misinterpreting subtle distinctions between desirable and undesirable outcomes.

Discuss Best Practices for Mitigation:

Focus on Model Objectives: Evaluate and refine training objectives to better capture the nuances of human preferences. Ensure that metrics like ranking accuracy and win rate are closely monitored during training.
Analyze and Curate Data: Use tools like the CHES (Centered Hidden Embedding Similarity) score to identify problematic training samples with overlapping preferences. Filtering or re-weighting these samples can reduce unintended misalignments.
Monitor Model Behavior: During and after training, track not just success metrics but also unintended shifts in behavior, particularly when dealing with safety-critical tasks. This allows for early detection of issues like likelihood displacement.
Iterative Feedback Loops: Incorporate iterative rounds of human feedback and testing to refine alignment progressively, addressing gaps between expected and actual outcomes.

How do we ensure AI models behave reliably and safely?

Aligning LLMs with human preferences using RLHF or DPO requires more than following established frameworks—it will take critical evaluation, post-deployment monitoring, and feedback loops to imrpove model performance in real-world applications.

What is Paper Club?

Paper Club is a virtual event series brought to you by the Human Feedback Foundation in collaboration with AI Tinkerers, featuring authors of cutting-edge AI and machine learning papers. These online meetups allow attendees to hear about groundbreaking research directly from the authors, participate in live Q&A sessions, and engage in discussions. Open to all, Paper Club offers a regular opportunity to learn and interact with leaders in the rapidly evolving field of artificial intelligence.

Join Paper Club with Princeton University on Model Alignment Challenges in Preference Learning

Event Ended

Join Our Paper Club Event Series! Meet with Sadhika Malladi, AI Researcher at Princeton University and discuss the challenges of aligning language models with human preferences.

Meet the Researcher

Key Insights from the Papers:

Key Risks to Watch Out For:

Discuss Best Practices for Mitigation:

How do we ensure AI models behave reliably and safely?

What is Paper Club?

Ready for more?

Contact Organizers

Sign in to continue

Enter the 4-digit verification code sent to your email

Join Paper Club with Princeton University on Model Alignment Challenges in Preference Learning

Event Ended

Join Our Paper Club Event Series! Meet with Sadhika Malladi, AI Researcher at Princeton University and discuss the challenges of aligning language models with human preferences.

Meet the Researcher

Key Insights from the Papers:

Key Risks to Watch Out For:

Discuss Best Practices for Mitigation:

How do we ensure AI models behave reliably and safely?

What is Paper Club?

Ready for more?

Subscribe to AI Tinkerers - Paper Club

Contact Organizers

Sign in to continue

Enter the 4-digit verification code sent to your email