Capability
Reward Model Training With Configurable Loss Functions
11 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
Reinforcement learning from human feedback — SFT, DPO, PPO trainers for LLM alignment.
Unique: Supports multiple loss variants (Bradley-Terry, Elo, margin-based) with automatic hyperparameter suggestions based on dataset statistics, and includes built-in reward calibration utilities to estimate preference probabilities from scores
vs others: More flexible than monolithic reward models because it supports both regression and ranking objectives; better integrated with TRL's ecosystem than standalone reward modeling libraries because it shares data pipeline and chat template handling