limit-of-RLVR by LeapLabTHU

Investigating RLVR's impact on LLM reasoning

Created 10 months ago

329 stars

Top 83.4% on SourcePulse

Project Summary

This repository provides code for a paper investigating if Reinforcement Learning with Verifiable Rewards (RLVR) genuinely enhances Large Language Model (LLM) reasoning or merely optimizes existing performance. It targets AI researchers and engineers, offering empirical evidence to guide RLVR application by clarifying its impact on reasoning boundaries versus sampling efficiency.

How It Works

The project evaluates RL-trained LLMs against base models using the pass@k metric across mathematical and coding benchmarks. Analysis reveals RLVR improves sampling efficiency at low 'k' but base models surpass RL-trained ones at larger 'k', suggesting RLVR may limit reasoning capacity rather than expand it. Experiments leverage vLLM for response diversity via seed management and sequential state progression.

Quick Start & Requirements

Evaluation code for DeepCoder and Math tasks is released. Specific installation or execution commands are not detailed, but prerequisites likely include a Python environment, vLLM, and potentially specific LLM checkpoints (e.g., DAPO, Oat-Zero). The primary reference is the arXiv paper: https://arxiv.org/abs/2504.13837.

Highlighted Details

Empirical evidence suggests RLVR boosts sampling efficiency but reduces LLMs' reasoning capacity boundary.
Base LLM models consistently catch up with and surpass RL-trained models in pass@k evaluations at larger k.
RLVR algorithms perform similarly, remain far from optimal, and are fundamentally different from distillation.
Evaluation code for DeepCoder, Math, and other benchmarks is available.

Maintenance & Community

The project originates from Tsinghua University and Shanghai Jiao Tong University. No specific community channels (Discord/Slack) or roadmap links are mentioned in the provided text.

Licensing & Compatibility

The README does not explicitly state the software license. This absence may pose compatibility concerns for commercial use or integration into closed-source projects until clarified.

Limitations & Caveats

The core finding indicates RLVR may not expand LLMs' fundamental reasoning capabilities and could potentially limit them, suggesting a re-evaluation of its application for advancing core reasoning skills. No specific unsupported platforms or known bugs are detailed.

limit-of-RLVR by LeapLabTHU

Explore Similar Projects

CoT-Collection by kaistAI

Vision-R1 by Osilly

SimpleTIR by ltzheng

Awesome-RL-based-LLM-Reasoning by bruno686

One-Shot-RLVR by ypwang61

discover by test-time-training

DAPO by BytedTsinghua-SIA

train-deepseek-r1 by FareedKhan-dev

Awesome-LLM-Post-training by mbzuai-oryx

Logic-RL by Unakar

ROLL by alibaba

AReaL by inclusionAI