Label-Free-RLVR by QingyangZhang

Curated papers on label-free reinforcement learning for LLMs

Created 7 months ago

298 stars

Top 89.2% on SourcePulse

Project Summary

This repository curates research papers on Label-Free Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). It serves as a valuable resource for researchers and practitioners exploring methods to train and improve LLMs without relying on external, human-annotated reward signals, focusing on self-supervision and intrinsic motivation.

How It Works

The collection highlights papers that explore various RLVR techniques, including self-play, entropy minimization, confidence maximization, and surrogate signals derived from output format or length. These approaches aim to enable LLMs to learn and refine their reasoning capabilities by generating their own training signals, reducing reliance on costly external supervision.

Highlighted Details

Covers a broad spectrum of RLVR strategies, from fully unsupervised methods to those requiring limited data or single examples.
Includes papers addressing potential pitfalls and evaluation challenges in RLVR research.
Features recent advancements in self-rewarding mechanisms and intrinsic motivation for LLM reasoning.
Organized into categories like "RLVR without External Supervision" and "RLVR with Limited Data" for clarity.

Maintenance & Community

This is a curated list of papers, maintained by Qingyang Zhang, Haitao Wu, and Yi Ding. Contributions and suggestions for missed papers are welcomed via issue reporting.

Licensing & Compatibility

The repository itself does not contain code or models; it is a collection of links to research papers. The licensing of individual papers is determined by their respective publishers or preprint servers (e.g., arXiv).

Limitations & Caveats

This repository is a literature collection and does not provide implementation code, datasets, or runnable examples. Users must refer to the individual papers for technical details, prerequisites, and experimental setups.

Health Check

Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days