TTRL  by PRIME-RL

RL technique for unlabeled data, especially test data

created 3 months ago
737 stars

Top 48.0% on sourcepulse

GitHubView on GitHub
Project Summary

TTRL (Test-Time Reinforcement Learning) addresses the challenge of improving Large Language Model (LLM) performance on reasoning tasks using unlabeled test data. It enables online reinforcement learning by deriving reward signals from inference-time data, making it suitable for scenarios where ground-truth labels are unavailable. The target audience includes researchers and practitioners working with LLMs who need to enhance model capabilities without relying on labeled datasets.

How It Works

TTRL leverages a novel approach where reward signals for reinforcement learning are derived from common test-time scaling (TTS) techniques, such as majority voting. This method bypasses the need for explicit ground-truth labels, allowing RL training to proceed on unlabeled inference data. The advantage lies in its ability to adapt and improve LLMs in real-world scenarios where labeled data is scarce or non-existent, using readily available inference outputs.

Quick Start & Requirements

  • Install: pip install -r requirements.txt, pip install -e .
  • Prerequisites: Python, requirements.txt dependencies, wandb_key for logging.
  • Hardware: 8 x NVIDIA A100 40GB GPUs were used for experiments.
  • Resources: Requires cloning the repository and installing dependencies.
  • Links: Paper, Github, Wandb Logs

Highlighted Details

  • Achieved a 159% boost in pass@1 performance for Qwen-2.5-Math-7B on AIME 2024 using unlabeled data.
  • Consistently surpasses the initial model's performance and approaches supervised training levels.
  • Reward function modification allows for rapid implementation and adaptation.
  • Code is a preview version based on OpenRLHF, with planned integration into official OpenRLHF and verl.

Maintenance & Community

  • The project is actively developed, with code and logs released on April 24, 2025.
  • Contact information for Kaiyan Zhang and Ning Ding is provided.
  • Future integration with OpenRLHF and verl is planned.

Licensing & Compatibility

  • The README does not explicitly state a license. The project is hosted on GitHub, implying a potential open-source license, but specific terms are not detailed.

Limitations & Caveats

The current code is a preview version and is still undergoing optimization. The AIME 2024 dataset exhibited instability, necessitating additional runs for validation. The specific license for commercial use or closed-source linking is not specified.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
351 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.