Capability
Variance Reduction In Policy Gradient Estimation Via Baseline Subtraction
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “reinforce leave-one-out (rloo) policy gradient training”
Reinforcement learning from human feedback — SFT, DPO, PPO trainers for LLM alignment.
Unique: Implements leave-one-out baseline estimation with automatic variance monitoring and adaptive learning rate scaling, reducing gradient variance by 30-50% compared to standard REINFORCE without value function overhead
vs others: Lower variance than standard REINFORCE because it uses batch-level baselines; simpler than PPO because it avoids value head training and importance weighting; more efficient than GRPO for small batch sizes