Capability
Async Grpo With Decoupled Generation And Training
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
Reinforcement learning from human feedback — SFT, DPO, PPO trainers for LLM alignment.
Unique: Queue-based async architecture with automatic load balancing and staleness monitoring, enabling 2-3x throughput improvement over synchronous GRPO while maintaining training stability through careful policy synchronization
vs others: Higher throughput than synchronous GRPO because generation and training are parallelized; more stable than naive async RL because it monitors policy staleness and adjusts queue sizes dynamically