learning-from-human-preferences  by mrahtz

RL from human preferences reproduction

created 7 years ago
325 stars

Top 85.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a reproduction of OpenAI and DeepMind's "Deep Reinforcement Learning from Human Preferences" paper, enabling users to train agents using human feedback. It targets researchers and practitioners interested in preference-based RL, offering a practical implementation for environments like Pong and Enduro.

How It Works

The project employs an asynchronous architecture with three main components: A2C workers for environment interaction and policy training, a preference interface for collecting human feedback on agent behavior clips, and a reward predictor network. Video clips generated by A2C workers are queued and presented in pairs by the preference interface. Human preferences are then fed to the reward predictor, which trains a neural network to estimate reward signals from agent behavior. These predicted rewards are used to train the A2C workers, creating a closed loop for preference-based learning.

Quick Start & Requirements

  • Install dependencies using pipenv install.
  • Manually install TensorFlow 1.x: pipenv run pip install tensorflow==1.15 or tensorflow-gpu==1.15.
  • Python 3.7 or below is required due to TensorFlow 1.x.
  • For testing: pipenv install --dev.
  • Enter the environment: pipenv shell.
  • Run training: python3 run.py <mode> <environment>.
  • Supported environments: MovingDotNoFrameskip-v0, PongNoFrameskip-v4, EnduroNoFrameskip-v4.
  • Official results: https://www.floydhub.com/mrahtz/projects/learning-from-human-preferences

Highlighted Details

  • Successfully trained agents for MovingDot, Pong, and Enduro using synthetic and human preferences.
  • Implemented a command-line interface for collecting human preferences on agent behavior clips.
  • Utilizes asynchronous subprocesses for A2C workers, preference interface, and reward predictor training.
  • Allows for piece-by-piece execution of training stages (gathering preferences, pretraining reward predictor).

Maintenance & Community

  • A2C code is based on OpenAI baselines (commit f8663ea).
  • Discussions on implementation details and potential improvements are available via GitHub issues.

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README.

Limitations & Caveats

  • Requires TensorFlow 1.x, which is outdated and has limited support.
  • Some features from the original paper, such as adaptive L2 regularization and ensemble-based clip selection, are not implemented or simplified.
  • The implementation of human-like random responses in preference prediction differs from the original paper.
Health Check
Last commit

3 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.