nano-aha-moment by McGill-NLP

Single-file library for "RL for LLMs" training

Created 10 months ago

575 stars

Top 56.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Vincent Weisser

Cofounder of Prime Intellect

Project Summary

This library provides a single-file, single-GPU implementation for Reinforcement Learning (RL) from Human Feedback (RLHF) for Large Language Models (LLMs), targeting researchers and practitioners who want to understand and experiment with RLHF training from scratch. It offers an efficient, full-parameter tuning approach, making complex RLHF training accessible and understandable.

How It Works

The library implements a DeepSeek R1-zero style training pipeline, focusing on simplicity and efficiency. It avoids external RL libraries, integrating all necessary components within a single Jupyter notebook or Python script. This design choice allows for complete visibility and understanding of the RLHF process, from data handling to model fine-tuning, facilitating rapid iteration and debugging.

Quick Start & Requirements

Install dependencies: pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124 and pip install -r requirements.txt.
Requires CUDA 12.4.
Training can be initiated by running nano_r1.ipynb or nano_r1_script.py.
Official model: McGill-NLP/nano-aha-moment-3b
Lecture Series: Part 1, Part 2

Highlighted Details

Single file, single GPU implementation for RLHF.
Full parameter tuning for efficient training (<10h).
Inspired by TinyZero and Mini-R1, with a focus on simplicity and clarity.
Achieved ~60% Accuracy on the CountDown Task with the trained 3B model.

Maintenance & Community

The project is associated with McGill University's NLP group. Further community engagement details (e.g., Discord/Slack) are not specified in the README.

Licensing & Compatibility

The repository does not explicitly state a license. This lack of explicit licensing means it defaults to all rights reserved, potentially restricting commercial use or integration into closed-source projects.

Limitations & Caveats

The project is described as a "Todo" item for a "Full evaluation suite," indicating that comprehensive benchmarking and validation may be incomplete. The absence of a specified license poses a significant caveat for adoption in commercial or closed-source environments.

Health Check

Last Commit

3 months ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days