simpleRL-reason by hkust-nlp

RL recipe for reasoning ability in models

Created 11 months ago

3,824 stars

Top 12.6% on SourcePulse

View on GitHub

6 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Yaowei Zheng

Author of LLaMA-Factory

Lewis Tunstall

Research Engineer at Hugging Face

and 2 more!

Project Summary

This repository provides a straightforward reinforcement learning (RL) framework for enhancing the reasoning capabilities of large language models (LLMs). It targets researchers and developers aiming to improve LLM performance on mathematical and logical reasoning tasks with minimal data and computational overhead. The key benefit is achieving significant accuracy gains across diverse models using a simple, rule-based reward system.

How It Works

The project employs a "zero RL" approach, meaning it fine-tunes base LLMs without prior supervised fine-tuning (SFT). It utilizes a rule-based reward mechanism, specifically tailored for reasoning tasks like GSM8K and MATH datasets. This method simplifies the RL training process by avoiding complex reward modeling, making it efficient and accessible. The framework is built upon Verl and leverages GRPO with Ray and vLLM for accelerated training and inference.

Quick Start & Requirements

Installation: pip3 install -e . (after setting up a verl conda environment with Python 3.9 and PyTorch 2.4.0 with CUDA 12.4). Flash-attn is also required.
Prerequisites: Python 3.9, PyTorch 2.4.0 (CUDA 12.4), Flash Attention, Ray for distributed training.
Data: Requires downloading datasets like train.parquet and test.parquet from Hugging Face.
Hardware: Minimum 1x A100-80G for 0.5B models; 2x8 H100-80G for 7B/14B models (15 hours for 100 steps); 8x8 H100-80G for 32B models (1.5 days).
Links: Paper, Hugging Face Collection, Blog

Highlighted Details

Achieves accuracy gains of 10-20+ absolute points on GSM8K and MATH benchmarks.
Successfully trained 10 diverse models including Llama3, Mistral, DeepSeekMath, and Qwen series (0.5B to 32B).
Uses a small dataset (8K examples) for efficient fine-tuning.
Supports distributed training via Ray and accelerated inference with vLLM.

Maintenance & Community

The project is associated with HKUST NLP. Key components are built upon Verl, vLLM, and Qwen2.5-Math. Links to a Notion blog and Hugging Face collection are provided.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. However, it relies on Verl, which is Apache 2.0 licensed. Compatibility for commercial use would depend on the licenses of the base models used and the underlying frameworks.

Limitations & Caveats

The training process is resource-intensive, requiring multiple high-end GPUs (H100s recommended for larger models). The README notes that increased response length does not necessarily correlate with specific cognitive behaviors like self-verification. The project is based on recent models and may require updates for compatibility with newer LLM architectures or libraries.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days