PRIME  by PRIME-RL

Scalable RL solution for advanced reasoning of language models

created 7 months ago
1,668 stars

Top 25.9% on sourcepulse

GitHubView on GitHub
Project Summary

PRIME is an open-source solution for enhancing large language model (LLM) reasoning capabilities through reinforcement learning (RL) with implicit process rewards. It targets researchers and developers aiming to improve LLM performance on complex tasks like math and coding, offering a scalable alternative to imitation learning by providing dense, online-updatable reward signals.

How It Works

PRIME leverages an "Implicit Process Reward Model" (Implicit PRM) trained as an outcome reward model (ORM). This approach avoids the need for explicit process labels, instead learning a Q-function that provides token-level rewards. The Implicit PRM is updated online with outcome verifiers, mitigating distribution shift and scalability issues. PRIME integrates this into an RL framework, where both the policy model and PRM are initialized from a Supervised Fine-Tuned (SFT) model. During RL iterations, rollouts are generated, scored by the PRM and an outcome verifier, and the PRM is updated. Combined outcome and process rewards then update the policy model, often using PPO.

Quick Start & Requirements

Highlighted Details

  • Achieves significant improvements (16.7% average, >20% on AMC/AIME) over SFT models.
  • Outperforms larger models like Llama-3.1-70B-Instruct and GPT-4o on specific reasoning benchmarks.
  • Demonstrates efficiency, achieving results with 1/10th the data and model resources of comparable models.
  • Utilizes vLLM for high-throughput inference and extends the veRL framework.

Maintenance & Community

  • Active development with recent news in March 2025.
  • Paper released on arXiv; code released January 2025.
  • Mentions extensions from veRL and usage of Eurus, Qwen2.5-Math, and LiveCodeBench.

Licensing & Compatibility

  • No explicit license is mentioned in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify the license, which is crucial for determining commercial usability. While performance claims are strong, the specific datasets and evaluation methodologies for benchmarks like AIME, MATH-500, and AMC are detailed in the paper, requiring further review for full context.

Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
152 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.