PRIME  by PRIME-RL

Scalable RL solution for advanced reasoning of language models

Created 8 months ago
1,731 stars

Top 24.7% on SourcePulse

GitHubView on GitHub
Project Summary

PRIME is an open-source solution for enhancing large language model (LLM) reasoning capabilities through reinforcement learning (RL) with implicit process rewards. It targets researchers and developers aiming to improve LLM performance on complex tasks like math and coding, offering a scalable alternative to imitation learning by providing dense, online-updatable reward signals.

How It Works

PRIME leverages an "Implicit Process Reward Model" (Implicit PRM) trained as an outcome reward model (ORM). This approach avoids the need for explicit process labels, instead learning a Q-function that provides token-level rewards. The Implicit PRM is updated online with outcome verifiers, mitigating distribution shift and scalability issues. PRIME integrates this into an RL framework, where both the policy model and PRM are initialized from a Supervised Fine-Tuned (SFT) model. During RL iterations, rollouts are generated, scored by the PRM and an outcome verifier, and the PRM is updated. Combined outcome and process rewards then update the policy model, often using PPO.

Quick Start & Requirements

Highlighted Details

  • Achieves significant improvements (16.7% average, >20% on AMC/AIME) over SFT models.
  • Outperforms larger models like Llama-3.1-70B-Instruct and GPT-4o on specific reasoning benchmarks.
  • Demonstrates efficiency, achieving results with 1/10th the data and model resources of comparable models.
  • Utilizes vLLM for high-throughput inference and extends the veRL framework.

Maintenance & Community

  • Active development with recent news in March 2025.
  • Paper released on arXiv; code released January 2025.
  • Mentions extensions from veRL and usage of Eurus, Qwen2.5-Math, and LiveCodeBench.

Licensing & Compatibility

  • No explicit license is mentioned in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify the license, which is crucial for determining commercial usability. While performance claims are strong, the specific datasets and evaluation methodologies for benchmarks like AIME, MATH-500, and AMC are detailed in the paper, requiring further review for full context.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
60 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
4 more.

simpleRL-reason by hkust-nlp

0.1%
4k
RL recipe for reasoning ability in models
Created 7 months ago
Updated 1 month ago
Feedback? Help us improve.