learning-from-human-preferences by mrahtz

RL from human preferences reproduction

Created 8 years ago

333 stars

Top 82.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Tom Brown

Cofounder of Anthropic

Project Summary

This repository provides a reproduction of OpenAI and DeepMind's "Deep Reinforcement Learning from Human Preferences" paper, enabling users to train agents using human feedback. It targets researchers and practitioners interested in preference-based RL, offering a practical implementation for environments like Pong and Enduro.

How It Works

The project employs an asynchronous architecture with three main components: A2C workers for environment interaction and policy training, a preference interface for collecting human feedback on agent behavior clips, and a reward predictor network. Video clips generated by A2C workers are queued and presented in pairs by the preference interface. Human preferences are then fed to the reward predictor, which trains a neural network to estimate reward signals from agent behavior. These predicted rewards are used to train the A2C workers, creating a closed loop for preference-based learning.

Quick Start & Requirements

Install dependencies using pipenv install.
Manually install TensorFlow 1.x: pipenv run pip install tensorflow==1.15 or tensorflow-gpu==1.15.
Python 3.7 or below is required due to TensorFlow 1.x.
For testing: pipenv install --dev.
Enter the environment: pipenv shell.
Run training: python3 run.py <mode> <environment>.
Supported environments: MovingDotNoFrameskip-v0, PongNoFrameskip-v4, EnduroNoFrameskip-v4.
Official results: https://www.floydhub.com/mrahtz/projects/learning-from-human-preferences

Highlighted Details

Successfully trained agents for MovingDot, Pong, and Enduro using synthetic and human preferences.
Implemented a command-line interface for collecting human preferences on agent behavior clips.
Utilizes asynchronous subprocesses for A2C workers, preference interface, and reward predictor training.
Allows for piece-by-piece execution of training stages (gathering preferences, pretraining reward predictor).

Maintenance & Community

A2C code is based on OpenAI baselines (commit f8663ea).
Discussions on implementation details and potential improvements are available via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

Requires TensorFlow 1.x, which is outdated and has limited support.
Some features from the original paper, such as adaptive L2 regularization and ensemble-based clip selection, are not implemented or simplified.
The implementation of human-like random responses in preference prediction differs from the original paper.

Health Check

Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days