RL technique for unlabeled data, especially test data
Top 48.0% on sourcepulse
TTRL (Test-Time Reinforcement Learning) addresses the challenge of improving Large Language Model (LLM) performance on reasoning tasks using unlabeled test data. It enables online reinforcement learning by deriving reward signals from inference-time data, making it suitable for scenarios where ground-truth labels are unavailable. The target audience includes researchers and practitioners working with LLMs who need to enhance model capabilities without relying on labeled datasets.
How It Works
TTRL leverages a novel approach where reward signals for reinforcement learning are derived from common test-time scaling (TTS) techniques, such as majority voting. This method bypasses the need for explicit ground-truth labels, allowing RL training to proceed on unlabeled inference data. The advantage lies in its ability to adapt and improve LLMs in real-world scenarios where labeled data is scarce or non-existent, using readily available inference outputs.
Quick Start & Requirements
pip install -r requirements.txt
, pip install -e .
requirements.txt
dependencies, wandb_key
for logging.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The current code is a preview version and is still undergoing optimization. The AIME 2024 dataset exhibited instability, necessitating additional runs for validation. The specific license for commercial use or closed-source linking is not specified.
3 weeks ago
Inactive