simpleRL-reason  by hkust-nlp

RL recipe for reasoning ability in models

created 6 months ago
3,697 stars

Top 13.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a straightforward reinforcement learning (RL) framework for enhancing the reasoning capabilities of large language models (LLMs). It targets researchers and developers aiming to improve LLM performance on mathematical and logical reasoning tasks with minimal data and computational overhead. The key benefit is achieving significant accuracy gains across diverse models using a simple, rule-based reward system.

How It Works

The project employs a "zero RL" approach, meaning it fine-tunes base LLMs without prior supervised fine-tuning (SFT). It utilizes a rule-based reward mechanism, specifically tailored for reasoning tasks like GSM8K and MATH datasets. This method simplifies the RL training process by avoiding complex reward modeling, making it efficient and accessible. The framework is built upon Verl and leverages GRPO with Ray and vLLM for accelerated training and inference.

Quick Start & Requirements

  • Installation: pip3 install -e . (after setting up a verl conda environment with Python 3.9 and PyTorch 2.4.0 with CUDA 12.4). Flash-attn is also required.
  • Prerequisites: Python 3.9, PyTorch 2.4.0 (CUDA 12.4), Flash Attention, Ray for distributed training.
  • Data: Requires downloading datasets like train.parquet and test.parquet from Hugging Face.
  • Hardware: Minimum 1x A100-80G for 0.5B models; 2x8 H100-80G for 7B/14B models (15 hours for 100 steps); 8x8 H100-80G for 32B models (1.5 days).
  • Links: Paper, Hugging Face Collection, Blog

Highlighted Details

  • Achieves accuracy gains of 10-20+ absolute points on GSM8K and MATH benchmarks.
  • Successfully trained 10 diverse models including Llama3, Mistral, DeepSeekMath, and Qwen series (0.5B to 32B).
  • Uses a small dataset (8K examples) for efficient fine-tuning.
  • Supports distributed training via Ray and accelerated inference with vLLM.

Maintenance & Community

The project is associated with HKUST NLP. Key components are built upon Verl, vLLM, and Qwen2.5-Math. Links to a Notion blog and Hugging Face collection are provided.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. However, it relies on Verl, which is Apache 2.0 licensed. Compatibility for commercial use would depend on the licenses of the base models used and the underlying frameworks.

Limitations & Caveats

The training process is resource-intensive, requiring multiple high-end GPUs (H100s recommended for larger models). The README notes that increased response length does not necessarily correlate with specific cognitive behaviors like self-verification. The project is based on recent models and may require updates for compatibility with newer LLM architectures or libraries.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
204 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.