limit-of-RLVR  by LeapLabTHU

Investigating RLVR's impact on LLM reasoning

Created 8 months ago
314 stars

Top 86.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides code for a paper investigating if Reinforcement Learning with Verifiable Rewards (RLVR) genuinely enhances Large Language Model (LLM) reasoning or merely optimizes existing performance. It targets AI researchers and engineers, offering empirical evidence to guide RLVR application by clarifying its impact on reasoning boundaries versus sampling efficiency.

How It Works

The project evaluates RL-trained LLMs against base models using the pass@k metric across mathematical and coding benchmarks. Analysis reveals RLVR improves sampling efficiency at low 'k' but base models surpass RL-trained ones at larger 'k', suggesting RLVR may limit reasoning capacity rather than expand it. Experiments leverage vLLM for response diversity via seed management and sequential state progression.

Quick Start & Requirements

Evaluation code for DeepCoder and Math tasks is released. Specific installation or execution commands are not detailed, but prerequisites likely include a Python environment, vLLM, and potentially specific LLM checkpoints (e.g., DAPO, Oat-Zero). The primary reference is the arXiv paper: https://arxiv.org/abs/2504.13837.

Highlighted Details

  • Empirical evidence suggests RLVR boosts sampling efficiency but reduces LLMs' reasoning capacity boundary.
  • Base LLM models consistently catch up with and surpass RL-trained models in pass@k evaluations at larger k.
  • RLVR algorithms perform similarly, remain far from optimal, and are fundamentally different from distillation.
  • Evaluation code for DeepCoder, Math, and other benchmarks is available.

Maintenance & Community

The project originates from Tsinghua University and Shanghai Jiao Tong University. No specific community channels (Discord/Slack) or roadmap links are mentioned in the provided text.

Licensing & Compatibility

The README does not explicitly state the software license. This absence may pose compatibility concerns for commercial use or integration into closed-source projects until clarified.

Limitations & Caveats

The core finding indicates RLVR may not expand LLMs' fundamental reasoning capabilities and could potentially limit them, suggesting a re-evaluation of its application for advancing core reasoning skills. No specific unsupported platforms or known bugs are detailed.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
32 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
3 more.

ROLL by alibaba

2.3%
3k
RL library for large language models
Created 7 months ago
Updated 23 hours ago
Feedback? Help us improve.