Capability
Retrospective Trajectory Optimization Via Policy Gradient Learning
8 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “generative-reward-optimization-grpo-training”
Train transformer language models with reinforcement learning.
Unique: Implements unified reward+policy training where the model generates both outputs and rewards in a single forward pass, reducing pipeline complexity compared to RLHF while maintaining explicit reward signals through a learned reward head
vs others: More integrated than RLHF because it eliminates separate reward model training, while more explicit than DPO because it maintains interpretable reward scores that can be inspected and debugged