Capability

Retrospective Trajectory Optimization Via Policy Gradient Learning

8 artifacts provide this capability.

Want a personalized recommendation?

Top Matches

via “generative-reward-optimization-grpo-training”

Train transformer language models with reinforcement learning.

Unique: Implements unified reward+policy training where the model generates both outputs and rewards in a single forward pass, reducing pipeline complexity compared to RLHF while maintaining explicit reward signals through a learned reward head

vs others: More integrated than RLHF because it eliminates separate reward model training, while more explicit than DPO because it maintains interpretable reward scores that can be inspected and debugged

Retrospective Trajectory Optimization Via Policy Gradient Learning

Top Matches

Also Known As

Company