avataRL by tokenbender

Train language models from scratch using pure reinforcement learning

Created 7 months ago

285 stars

Top 91.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Cofounder of Prime Intellect

Project Summary

This repository provides an implementation for training language models from scratch using pure reinforcement learning (RL), aiming to optimize generalization over memorization. It targets researchers and practitioners interested in alternative LLM training paradigms beyond traditional pretrain-sft-rl pipelines. The core benefit is exploring a novel RL-centric approach to LLM development.

How It Works

The project implements a GPT-2 architecture trained exclusively with RL, utilizing a "referee model" approach. This referee model, trained on ground truth data, scores the predictions of the main language model. This method allows for convergence with reasonable compute, where the referee model's size is not necessarily larger than the model being trained.

Quick Start & Requirements

Local Training: bash start.sh (sets up environment, downloads data/models) then python avatarl.py (single GPU) or torchrun --nproc_per_node=8 avatarl.py (multi-GPU).
Modal Cloud Training: pip install modal, modal setup, then modal run modal_train.py:train_avatarl_single_node.
Prerequisites: Python 3.12, PyTorch 2.6.0, CUDA 12.6+ capable GPU (H100/A100 recommended), Flash Attention. Modal account and CLI for cloud training.
Links: avatarl.md

Highlighted Details

Explores training LLMs from scratch with pure RL, bypassing traditional pretraining.
Utilizes a "referee model" for scoring predictions, enabling efficient RL training.
Codebase cleaned and refactored for improved performance and clarity.
Includes scripts for local and Modal Cloud distributed training and evaluation.

Maintenance & Community

The project is primarily driven by "tokenbender." Contributions are welcome via pull requests.

Licensing & Compatibility

Licensed under the Apache 2.0 license. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

Early development stages involved significant iteration and debugging, with some experimental approaches proving inefficient or unstable. The project's success relies on the effectiveness of the referee model and the RL reward shaping.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days