Capability

Reference Model Based Preference Normalization

1 artifact provides this capability.

Want a personalized recommendation?

Top Matches

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO)Product24/100

via “reference model-based preference normalization”

* ⏫ 06/2023: [Faster sorting algorithms discovered using deep reinforcement learning (AlphaDev)](https://www.nature.com/articles/s41586-023-06004-9)

Unique: Uses a reference model to normalize preference signals, preventing the optimization from drifting away from the base model distribution while still learning preferences—a key insight that distinguishes DPO from naive supervised fine-tuning on preference pairs

Reference Model Based Preference Normalization

Top Matches

Also Known As

Company