TTRL by PRIME-RL

RL technique for unlabeled data, especially test data

Created 8 months ago

953 stars

Top 38.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

TTRL (Test-Time Reinforcement Learning) addresses the challenge of improving Large Language Model (LLM) performance on reasoning tasks using unlabeled test data. It enables online reinforcement learning by deriving reward signals from inference-time data, making it suitable for scenarios where ground-truth labels are unavailable. The target audience includes researchers and practitioners working with LLMs who need to enhance model capabilities without relying on labeled datasets.

How It Works

TTRL leverages a novel approach where reward signals for reinforcement learning are derived from common test-time scaling (TTS) techniques, such as majority voting. This method bypasses the need for explicit ground-truth labels, allowing RL training to proceed on unlabeled inference data. The advantage lies in its ability to adapt and improve LLMs in real-world scenarios where labeled data is scarce or non-existent, using readily available inference outputs.

Quick Start & Requirements

Install: pip install -r requirements.txt, pip install -e .
Prerequisites: Python, requirements.txt dependencies, wandb_key for logging.
Hardware: 8 x NVIDIA A100 40GB GPUs were used for experiments.
Resources: Requires cloning the repository and installing dependencies.
Links: Paper, Github, Wandb Logs

Highlighted Details

Achieved a 159% boost in pass@1 performance for Qwen-2.5-Math-7B on AIME 2024 using unlabeled data.
Consistently surpasses the initial model's performance and approaches supervised training levels.
Reward function modification allows for rapid implementation and adaptation.
Code is a preview version based on OpenRLHF, with planned integration into official OpenRLHF and verl.

Maintenance & Community

The project is actively developed, with code and logs released on April 24, 2025.
Contact information for Kaiyan Zhang and Ning Ding is provided.
Future integration with OpenRLHF and verl is planned.

Licensing & Compatibility

The README does not explicitly state a license. The project is hosted on GitHub, implying a potential open-source license, but specific terms are not detailed.

Limitations & Caveats

The current code is a preview version and is still undergoing optimization. The AIME 2024 dataset exhibited instability, necessitating additional runs for validation. The specific license for commercial use or closed-source linking is not specified.

TTRL by PRIME-RL

Explore Similar Projects

XBai-o4 by MetaStone-AI

Tina by shangshang-wang

POLARIS by ChenxinAn-fdu

Slow_Thinking_with_LLMs by RUCAIBox

R-Zero by Chengsong-Huang

MiMo by XiaomiMiMo

PRIME by PRIME-RL

train-deepseek-r1 by FareedKhan-dev

Awesome-RL-for-LRMs by TsinghuaC3I

simpleRL-reason by hkust-nlp

rllm by rllm-org

DeepSeek-R1 by deepseek-ai