One-Shot-RLVR  by ypwang61

RL fine-tuning with one training example

created 3 months ago
330 stars

Top 84.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an official implementation for "Reinforcement Learning for Reasoning in Large Language Models with One Training Example" (One-Shot RLVR). It enables efficient fine-tuning of Large Language Models (LLMs) for complex reasoning tasks using minimal data, targeting researchers and practitioners aiming to improve LLM performance on benchmarks like mathematical reasoning with limited supervision.

How It Works

The project leverages Reinforcement Learning (RL) to enhance LLM reasoning capabilities, specifically focusing on a "one-shot" learning paradigm. It adapts existing frameworks like verl and rllm (DeepScaleR) for training and uses a modified version of Qwen2.5-Math for evaluation. The core idea is to train LLMs with a single demonstration example per task, significantly reducing data requirements while maintaining or improving performance on reasoning benchmarks.

Quick Start & Requirements

  • Training Environment:
    • conda create -y -n rlvr_train python=3.10
    • conda activate rlvr_train
    • pip install -e .
    • pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
    • pip install ray vllm==0.6.3 flash-attn --no-build-isolation wandb matplotlib huggingface_hub
  • Evaluation Environment:
    • conda create -y -n rlvr_eval python=3.10
    • conda activate rlvr_eval
    • cd Qwen2.5-Eval/evaluation
    • pip install -e . (for latex2sympy)
    • pip install -r requirements.txt vllm==0.5.1 transformers==4.42.3 wandb matplotlib
  • Prerequisites: Python 3.10, PyTorch 2.4.0 with CUDA 12.1 support, vllm, flash-attn, wandb, matplotlib, huggingface_hub.
  • Resources: Training requires significant GPU resources. Checkpoints and datasets are available on Hugging Face.
  • Links: Paper, Models/Dataset, WandB Logs

Highlighted Details

  • Achieves competitive performance on math reasoning benchmarks with only one training example.
  • Supports multiple base models including Qwen2.5-Math (1.5B, 7B) and DeepSeek-R1-Distill-Qwen-1.5B.
  • Provides detailed evaluation results and reproducible scripts for various benchmarks (MATH500, AIME, AMC, etc.).
  • Includes a data selection strategy based on historical variance score.

Maintenance & Community

The project is associated with multiple researchers from Microsoft. Updates are posted on X (Twitter).

Licensing & Compatibility

The repository itself appears to be under an MIT license based on typical GitHub practices, but the underlying frameworks (verl, rllm, Qwen2.5-Math) may have different licenses. Commercial use compatibility depends on the licenses of these base models and frameworks.

Limitations & Caveats

The setup requires specific versions of PyTorch and CUDA, and involves multiple complex dependencies. Evaluation scripts are adapted from other projects, potentially requiring careful configuration. The project is recent, and long-term maintenance status is not yet established.

Health Check
Last commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
3
Star History
240 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.