RL fine-tuning with one training example
Top 84.0% on sourcepulse
This repository provides an official implementation for "Reinforcement Learning for Reasoning in Large Language Models with One Training Example" (One-Shot RLVR). It enables efficient fine-tuning of Large Language Models (LLMs) for complex reasoning tasks using minimal data, targeting researchers and practitioners aiming to improve LLM performance on benchmarks like mathematical reasoning with limited supervision.
How It Works
The project leverages Reinforcement Learning (RL) to enhance LLM reasoning capabilities, specifically focusing on a "one-shot" learning paradigm. It adapts existing frameworks like verl
and rllm
(DeepScaleR) for training and uses a modified version of Qwen2.5-Math for evaluation. The core idea is to train LLMs with a single demonstration example per task, significantly reducing data requirements while maintaining or improving performance on reasoning benchmarks.
Quick Start & Requirements
conda create -y -n rlvr_train python=3.10
conda activate rlvr_train
pip install -e .
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install ray vllm==0.6.3 flash-attn --no-build-isolation wandb matplotlib huggingface_hub
conda create -y -n rlvr_eval python=3.10
conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
pip install -e .
(for latex2sympy)pip install -r requirements.txt vllm==0.5.1 transformers==4.42.3 wandb matplotlib
vllm
, flash-attn
, wandb
, matplotlib
, huggingface_hub
.Highlighted Details
Maintenance & Community
The project is associated with multiple researchers from Microsoft. Updates are posted on X (Twitter).
Licensing & Compatibility
The repository itself appears to be under an MIT license based on typical GitHub practices, but the underlying frameworks (verl
, rllm
, Qwen2.5-Math
) may have different licenses. Commercial use compatibility depends on the licenses of these base models and frameworks.
Limitations & Caveats
The setup requires specific versions of PyTorch and CUDA, and involves multiple complex dependencies. Evaluation scripts are adapted from other projects, potentially requiring careful configuration. The project is recent, and long-term maintenance status is not yet established.
4 days ago
Inactive