TTRL  by PRIME-RL

RL technique for unlabeled data, especially test data

Created 4 months ago
809 stars

Top 43.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

TTRL (Test-Time Reinforcement Learning) addresses the challenge of improving Large Language Model (LLM) performance on reasoning tasks using unlabeled test data. It enables online reinforcement learning by deriving reward signals from inference-time data, making it suitable for scenarios where ground-truth labels are unavailable. The target audience includes researchers and practitioners working with LLMs who need to enhance model capabilities without relying on labeled datasets.

How It Works

TTRL leverages a novel approach where reward signals for reinforcement learning are derived from common test-time scaling (TTS) techniques, such as majority voting. This method bypasses the need for explicit ground-truth labels, allowing RL training to proceed on unlabeled inference data. The advantage lies in its ability to adapt and improve LLMs in real-world scenarios where labeled data is scarce or non-existent, using readily available inference outputs.

Quick Start & Requirements

  • Install: pip install -r requirements.txt, pip install -e .
  • Prerequisites: Python, requirements.txt dependencies, wandb_key for logging.
  • Hardware: 8 x NVIDIA A100 40GB GPUs were used for experiments.
  • Resources: Requires cloning the repository and installing dependencies.
  • Links: Paper, Github, Wandb Logs

Highlighted Details

  • Achieved a 159% boost in pass@1 performance for Qwen-2.5-Math-7B on AIME 2024 using unlabeled data.
  • Consistently surpasses the initial model's performance and approaches supervised training levels.
  • Reward function modification allows for rapid implementation and adaptation.
  • Code is a preview version based on OpenRLHF, with planned integration into official OpenRLHF and verl.

Maintenance & Community

  • The project is actively developed, with code and logs released on April 24, 2025.
  • Contact information for Kaiyan Zhang and Ning Ding is provided.
  • Future integration with OpenRLHF and verl is planned.

Licensing & Compatibility

  • The README does not explicitly state a license. The project is hosted on GitHub, implying a potential open-source license, but specific terms are not detailed.

Limitations & Caveats

The current code is a preview version and is still undergoing optimization. The AIME 2024 dataset exhibited instability, necessitating additional runs for validation. The specific license for commercial use or closed-source linking is not specified.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
50 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
4 more.

simpleRL-reason by hkust-nlp

0.1%
4k
RL recipe for reasoning ability in models
Created 7 months ago
Updated 1 month ago
Starred by Michael Han Michael Han(Cofounder of Unsloth), Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), and
19 more.

DeepSeek-R1 by deepseek-ai

0.1%
91k
Reasoning models research paper
Created 8 months ago
Updated 2 months ago
Feedback? Help us improve.