RL recipe for reasoning ability in models
Top 13.4% on sourcepulse
This repository provides a straightforward reinforcement learning (RL) framework for enhancing the reasoning capabilities of large language models (LLMs). It targets researchers and developers aiming to improve LLM performance on mathematical and logical reasoning tasks with minimal data and computational overhead. The key benefit is achieving significant accuracy gains across diverse models using a simple, rule-based reward system.
How It Works
The project employs a "zero RL" approach, meaning it fine-tunes base LLMs without prior supervised fine-tuning (SFT). It utilizes a rule-based reward mechanism, specifically tailored for reasoning tasks like GSM8K and MATH datasets. This method simplifies the RL training process by avoiding complex reward modeling, making it efficient and accessible. The framework is built upon Verl and leverages GRPO with Ray and vLLM for accelerated training and inference.
Quick Start & Requirements
pip3 install -e .
(after setting up a verl
conda environment with Python 3.9 and PyTorch 2.4.0 with CUDA 12.4). Flash-attn is also required.train.parquet
and test.parquet
from Hugging Face.Highlighted Details
Maintenance & Community
The project is associated with HKUST NLP. Key components are built upon Verl, vLLM, and Qwen2.5-Math. Links to a Notion blog and Hugging Face collection are provided.
Licensing & Compatibility
The repository itself does not explicitly state a license in the README. However, it relies on Verl, which is Apache 2.0 licensed. Compatibility for commercial use would depend on the licenses of the base models used and the underlying frameworks.
Limitations & Caveats
The training process is resource-intensive, requiring multiple high-end GPUs (H100s recommended for larger models). The README notes that increased response length does not necessarily correlate with specific cognitive behaviors like self-verification. The project is based on recent models and may require updates for compatibility with newer LLM architectures or libraries.
3 months ago
1 day