DeepSeekRL-Extended by brendanhogan

GRPO implementation for scaled RL research

Created 1 year ago

252 stars

Top 99.6% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Will Brown

Research Lead at Prime Intellect

Project Summary

This project provides a from-scratch implementation of Generalized Reward-Powered Optimization (GRPO) for language models, specifically demonstrating training of Qwen1.5B on the GSM8K grade school math dataset. It targets researchers and engineers seeking to understand and experiment with core RL mechanics without relying on complex external libraries. The key benefit is a simplified, modular codebase designed for learning, experimentation, and potentially scaling down complex RL research.

How It Works

The core innovation lies in computing the GRPO loss function directly within the codebase, rather than abstracting it into external RL libraries. This approach enhances transparency and facilitates deeper understanding. The system is architected into distinct Python scripts: main.py orchestrates the training loop, llms.py handles model loading (currently supporting LLaMA and Qwen via Hugging Face Transformers), rldatasets.py manages dataset loading and preprocessing (GSM8K), and evaluator.py implements reward functions and metrics mirroring DeepSeek's original setup. This modularity aids learning and experimentation.

Quick Start & Requirements

Primary install command: pip install -r requirements.txt
Prerequisites: Requires a Hugging Face token, configurable via environment variable (export HUGGINGFACE_TOKEN="your-token-here") or by running huggingface-cli login.
Hardware: Training was conducted on a single NVIDIA H100 GPU.

Highlighted Details

Direct GRPO loss calculation for educational clarity.
Modular, multi-script design tailored for learning and experimentation.
Demonstrates training Qwen1.5B on the GSM8K dataset.
Codebase restructured for easier understanding compared to original implementations.

Maintenance & Community

The README does not specify maintainers, community channels (like Discord or Slack), or a public roadmap.

Licensing & Compatibility

The license under which this project is distributed is not mentioned in the provided README.

Limitations & Caveats

The current implementation is focused on smaller-scale learning and experimentation. Future directions, such as adding self-play, implementing soft reward structures, or expanding to vision-language models, necessitate improvements in execution speed and multi-GPU training support, indicating these are not yet available.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days