Join Paper Club with Princeton University on Model Alignment Challenges in Preference Learning
Event Ended
This event has already taken place.
Join Our Paper Club Event Series! Meet with Sadhika Malladi, AI Researcher at Princeton University and discuss the challenges of aligning language models with human preferences.
Don’t miss this unique opportunity: Hear directly from the researcher & join a live Q&A!
☝️ Register Above for this Live Virtual Meeting with the Researcher! ☝️

| Info | Details |
|---|---|
| Event | Paper Club with Sadhika Malladi on “Preference Learning Algorithms Do Not Learn Preference Rankings” and “Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization” |
| Date & Time | November 26, 2024, 12:00 PM EST |
| Presenter | Sadhika Malladi, AI Researcher, Princeton University |
| Research Papers | 📄 Preference Learning Algorithms Do Not Learn Preference Rankings and 📄 Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization |
| Audio Version |
By Paper2Audio |
| Audio Version |
By Paper2Audio |
Meet the Researcher
Meet Sadhika Malladi, an AI researcher focused on preference learning and alignment in AI systems. Her recent work explores the limitations of current preference learning algorithms and highlights the risks associated with likelihood displacement during training.
Key Insights from the Papers:
For AI/ML engineers, aligning language models (LLMs) with human preferences using methods like RLHF (Reinforcement Learning with Human Feedback) and DPO (Direct Preference Optimization) presents both opportunities and challenges. We will walk through the recent Princeton University research on risks and best mitigation practices.
Key Risks to Watch Out For:
- Ranking Accuracy Gap: Even state-of-the-art models often fail to rank preferred outputs accurately, achieving less than 60% ranking accuracy. This highlights a disconnect between current training objectives and the desired model behavior.
- Likelihood Displacement: Training can unintentionally decrease the likelihood of preferred responses and shift probability mass to harmful or incorrect outputs. For example, a model trained to refuse unsafe prompts saw its refusal rate drop from 74.4% to 33.4%, introducing unintended risks.
- Overlapping Preferences in Data: Preferences that are too similar in the training dataset can exacerbate alignment issues, leading to models misinterpreting subtle distinctions between desirable and undesirable outcomes.
Discuss Best Practices for Mitigation:
- Focus on Model Objectives: Evaluate and refine training objectives to better capture the nuances of human preferences. Ensure that metrics like ranking accuracy and win rate are closely monitored during training.
- Analyze and Curate Data: Use tools like the CHES (Centered Hidden Embedding Similarity) score to identify problematic training samples with overlapping preferences. Filtering or re-weighting these samples can reduce unintended misalignments.
- Monitor Model Behavior: During and after training, track not just success metrics but also unintended shifts in behavior, particularly when dealing with safety-critical tasks. This allows for early detection of issues like likelihood displacement.
- Iterative Feedback Loops: Incorporate iterative rounds of human feedback and testing to refine alignment progressively, addressing gaps between expected and actual outcomes.
How do we ensure AI models behave reliably and safely?
Aligning LLMs with human preferences using RLHF or DPO requires more than following established frameworks—it will take critical evaluation, post-deployment monitoring, and feedback loops to imrpove model performance in real-world applications.
What is Paper Club?
Paper Club is a virtual event series brought to you by the Human Feedback Foundation in collaboration with AI Tinkerers, featuring authors of cutting-edge AI and machine learning papers. These online meetups allow attendees to hear about groundbreaking research directly from the authors, participate in live Q&A sessions, and engage in discussions. Open to all, Paper Club offers a regular opportunity to learn and interact with leaders in the rapidly evolving field of artificial intelligence.