avataRL  by tokenbender

Train language models from scratch using pure reinforcement learning

Created 3 months ago
268 stars

Top 95.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides an implementation for training language models from scratch using pure reinforcement learning (RL), aiming to optimize generalization over memorization. It targets researchers and practitioners interested in alternative LLM training paradigms beyond traditional pretrain-sft-rl pipelines. The core benefit is exploring a novel RL-centric approach to LLM development.

How It Works

The project implements a GPT-2 architecture trained exclusively with RL, utilizing a "referee model" approach. This referee model, trained on ground truth data, scores the predictions of the main language model. This method allows for convergence with reasonable compute, where the referee model's size is not necessarily larger than the model being trained.

Quick Start & Requirements

  • Local Training: bash start.sh (sets up environment, downloads data/models) then python avatarl.py (single GPU) or torchrun --nproc_per_node=8 avatarl.py (multi-GPU).
  • Modal Cloud Training: pip install modal, modal setup, then modal run modal_train.py:train_avatarl_single_node.
  • Prerequisites: Python 3.12, PyTorch 2.6.0, CUDA 12.6+ capable GPU (H100/A100 recommended), Flash Attention. Modal account and CLI for cloud training.
  • Links: avatarl.md

Highlighted Details

  • Explores training LLMs from scratch with pure RL, bypassing traditional pretraining.
  • Utilizes a "referee model" for scoring predictions, enabling efficient RL training.
  • Codebase cleaned and refactored for improved performance and clarity.
  • Includes scripts for local and Modal Cloud distributed training and evaluation.

Maintenance & Community

The project is primarily driven by "tokenbender." Contributions are welcome via pull requests.

Licensing & Compatibility

Licensed under the Apache 2.0 license. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

Early development stages involved significant iteration and debugging, with some experimental approaches proving inefficient or unstable. The project's success relies on the effectiveness of the referee model and the RL reward shaping.

Health Check
Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
19 more.

trlx by CarperAI

0.0%
5k
Distributed RLHF for LLMs
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.