RL from human preferences reproduction
Top 85.0% on sourcepulse
This repository provides a reproduction of OpenAI and DeepMind's "Deep Reinforcement Learning from Human Preferences" paper, enabling users to train agents using human feedback. It targets researchers and practitioners interested in preference-based RL, offering a practical implementation for environments like Pong and Enduro.
How It Works
The project employs an asynchronous architecture with three main components: A2C workers for environment interaction and policy training, a preference interface for collecting human feedback on agent behavior clips, and a reward predictor network. Video clips generated by A2C workers are queued and presented in pairs by the preference interface. Human preferences are then fed to the reward predictor, which trains a neural network to estimate reward signals from agent behavior. These predicted rewards are used to train the A2C workers, creating a closed loop for preference-based learning.
Quick Start & Requirements
pipenv install
.pipenv run pip install tensorflow==1.15
or tensorflow-gpu==1.15
.pipenv install --dev
.pipenv shell
.python3 run.py <mode> <environment>
.MovingDotNoFrameskip-v0
, PongNoFrameskip-v4
, EnduroNoFrameskip-v4
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
3 years ago
1 day