Experience Replay Buffer With Prioritized Sampling For Off Policy Learning

1

Human-level control through deep reinforcement learning (Deep Q Network)Product22/100

via “experience replay buffer with prioritized sampling for off-policy learning”

* 🏆 2015: [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (Faster R-CNN)](https://papers.nips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html)

Unique: Introduces experience replay as a core stabilization mechanism for deep Q-learning, enabling off-policy updates from a replay buffer rather than on-policy streaming updates. This architectural choice decouples exploration (data collection) from exploitation (learning), allowing the same transition to be used multiple times with different target networks.

vs others: Reduces sample complexity by 5-10x compared to on-policy methods (e.g., policy gradient) and stabilizes training variance by breaking temporal correlations, though at the cost of increased memory overhead and potential off-policy bias.

2

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization (Retroformer)Product19/100

via “trajectory replay and batch policy gradient estimation”

### Other Papers <a name="2023op"></a>

Unique: Implements trajectory replay as a first-class learning mechanism, enabling agents to learn from historical data without online interaction — this is distinct from online RL agents that require continuous environment interaction

vs others: More sample-efficient than online RL because trajectories are reused multiple times, and more stable than single-trajectory updates because batch averaging reduces gradient variance

3

Efficient Online Reinforcement Learning with Offline Data (RLPD)Product18/100

via “offline-online hybrid reinforcement learning with replay buffer fusion”

* ⏫ 03/2023: [Reward Design with Language Models](https://arxiv.org/abs/2303.00001)

Unique: RLPD introduces a principled weighting scheme that treats offline and online data asymmetrically during gradient updates, using a learned importance weight that adapts based on Q-function uncertainty rather than fixed mixing ratios. This contrasts with prior offline-RL methods (CQL, IQL) that either freeze the policy or use uniform conservative penalties.

vs others: More sample-efficient than pure online RL (SAC, PPO) when offline data exists, and more adaptive than fixed offline-RL methods (CQL) because it actively improves through online interaction without requiring manual hyperparameter tuning of conservatism levels

4

DataloopProduct

via “active learning sample prioritization”

Top Matches

Also Known As

Company