One-Shot-RLVR by ypwang61

RL fine-tuning with one training example

Created 8 months ago

392 stars

Top 73.4% on SourcePulse

Project Summary

This repository provides an official implementation for "Reinforcement Learning for Reasoning in Large Language Models with One Training Example" (One-Shot RLVR). It enables efficient fine-tuning of Large Language Models (LLMs) for complex reasoning tasks using minimal data, targeting researchers and practitioners aiming to improve LLM performance on benchmarks like mathematical reasoning with limited supervision.

How It Works

The project leverages Reinforcement Learning (RL) to enhance LLM reasoning capabilities, specifically focusing on a "one-shot" learning paradigm. It adapts existing frameworks like verl and rllm (DeepScaleR) for training and uses a modified version of Qwen2.5-Math for evaluation. The core idea is to train LLMs with a single demonstration example per task, significantly reducing data requirements while maintaining or improving performance on reasoning benchmarks.

Quick Start & Requirements

Training Environment:
- conda create -y -n rlvr_train python=3.10
- conda activate rlvr_train
- pip install -e .
- pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
- pip install ray vllm==0.6.3 flash-attn --no-build-isolation wandb matplotlib huggingface_hub
Evaluation Environment:
- conda create -y -n rlvr_eval python=3.10
- conda activate rlvr_eval
- cd Qwen2.5-Eval/evaluation
- pip install -e . (for latex2sympy)
- pip install -r requirements.txt vllm==0.5.1 transformers==4.42.3 wandb matplotlib
Prerequisites: Python 3.10, PyTorch 2.4.0 with CUDA 12.1 support, vllm, flash-attn, wandb, matplotlib, huggingface_hub.
Resources: Training requires significant GPU resources. Checkpoints and datasets are available on Hugging Face.
Links: Paper, Models/Dataset, WandB Logs

Highlighted Details

Achieves competitive performance on math reasoning benchmarks with only one training example.
Supports multiple base models including Qwen2.5-Math (1.5B, 7B) and DeepSeek-R1-Distill-Qwen-1.5B.
Provides detailed evaluation results and reproducible scripts for various benchmarks (MATH500, AIME, AMC, etc.).
Includes a data selection strategy based on historical variance score.

Maintenance & Community

The project is associated with multiple researchers from Microsoft. Updates are posted on X (Twitter).

Licensing & Compatibility

The repository itself appears to be under an MIT license based on typical GitHub practices, but the underlying frameworks (verl, rllm, Qwen2.5-Math) may have different licenses. Commercial use compatibility depends on the licenses of these base models and frameworks.

Limitations & Caveats

The setup requires specific versions of PyTorch and CUDA, and involves multiple complex dependencies. Evaluation scripts are adapted from other projects, potentially requiring careful configuration. The project is recent, and long-term maintenance status is not yet established.

One-Shot-RLVR by ypwang61

Explore Similar Projects

aimo-progress-prize by project-numina

ReasonFlux by Gen-Verse

POLARIS by ChenxinAn-fdu

lmm-r1 by TideDra

genrl by SforAiDl

DAPO by BytedTsinghua-SIA

X-R1 by dhcode-cpp

Open-Reasoner-Zero by Open-Reasoner-Zero

Logic-RL by Unakar

simpleRL-reason by hkust-nlp

rllm by rllm-org

DeepSeek-Math by deepseek-ai