DeepSeekRL-Extended  by brendanhogan

GRPO implementation for scaled RL research

Created 11 months ago
251 stars

Top 99.9% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a from-scratch implementation of Generalized Reward-Powered Optimization (GRPO) for language models, specifically demonstrating training of Qwen1.5B on the GSM8K grade school math dataset. It targets researchers and engineers seeking to understand and experiment with core RL mechanics without relying on complex external libraries. The key benefit is a simplified, modular codebase designed for learning, experimentation, and potentially scaling down complex RL research.

How It Works

The core innovation lies in computing the GRPO loss function directly within the codebase, rather than abstracting it into external RL libraries. This approach enhances transparency and facilitates deeper understanding. The system is architected into distinct Python scripts: main.py orchestrates the training loop, llms.py handles model loading (currently supporting LLaMA and Qwen via Hugging Face Transformers), rldatasets.py manages dataset loading and preprocessing (GSM8K), and evaluator.py implements reward functions and metrics mirroring DeepSeek's original setup. This modularity aids learning and experimentation.

Quick Start & Requirements

  • Primary install command: pip install -r requirements.txt
  • Prerequisites: Requires a Hugging Face token, configurable via environment variable (export HUGGINGFACE_TOKEN="your-token-here") or by running huggingface-cli login.
  • Hardware: Training was conducted on a single NVIDIA H100 GPU.

Highlighted Details

  • Direct GRPO loss calculation for educational clarity.
  • Modular, multi-script design tailored for learning and experimentation.
  • Demonstrates training Qwen1.5B on the GSM8K dataset.
  • Codebase restructured for easier understanding compared to original implementations.

Maintenance & Community

The README does not specify maintainers, community channels (like Discord or Slack), or a public roadmap.

Licensing & Compatibility

The license under which this project is distributed is not mentioned in the provided README.

Limitations & Caveats

The current implementation is focused on smaller-scale learning and experimentation. Future directions, such as adding self-play, implementing soft reward structures, or expanding to vision-language models, necessitate improvements in execution speed and multi-GPU training support, indicating these are not yet available.

Health Check
Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Deepak Pathak Deepak Pathak(Cofounder of Skild AI; Professor at CMU), Anastasis Germanidis Anastasis Germanidis(Cofounder of Runway), and
1 more.

deer by VinF

0%
489
Deep reinforcement learning framework
Created 10 years ago
Updated 7 months ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
4 more.

simpleRL-reason by hkust-nlp

0.1%
4k
RL recipe for reasoning ability in models
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.