RLP by NVlabs

Reinforcement learning pre-training for enhanced reasoning

Created 9 months ago

252 stars

Top 99.6% on SourcePulse

Project Summary

Summary RLP (Reinforcement Learning Pre-training) addresses LLMs' lack of "thinking" during pre-training. It introduces a novel objective treating Chain-of-Thought (CoT) as an action, rewarded by information gain on the next token. This verifier-free, dense reward mechanism enhances reasoning foundations during pre-training, benefiting researchers and engineers seeking more robust LLMs.

How It Works

RLP reframes pre-training by treating Chain-of-Thought (CoT) generation as an action taken before next-token prediction. This action is rewarded based on the information gain it contributes to predicting the observed next token. This approach provides a dense, verifier-free reward signal directly applicable to standard text pre-training corpora, fundamentally instilling reasoning capabilities early.

Quick Start & Requirements

The official code repository is slated for release soon. Specific installation instructions, dependencies (e.g., Python, CUDA), and hardware prerequisites are not yet detailed. Links to official quick-start guides, documentation, or demos are also unavailable.

Highlighted Details

Qwen3 1.7B Base: RLP boosts math/science performance (+19% avg. over base, +17% over CPT). Gains compound post-training.
Nemotron Nano 12B v2 Base: Applied for 250M tokens, RLP outperforms base (20T tokens) significantly (+35% avg.), especially in science (+23 pts).
Architecture Agnostic: Generalizes across models, including hybrid Mamba-Transformer designs.
Efficiency: Delivers performance gains without extra compute or extensive token exposure.

Maintenance & Community

Associated with NVIDIA Corporation, with contributions from Ali Hatamizadeh, Syeda Nahida Akter, Shrimai Prabhumoye, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. No community channels or roadmap links are provided.

Licensing & Compatibility

Copyrighted by NVIDIA Corporation (© 2025), all rights reserved. This proprietary licensing likely restricts commercial use or integration into closed-source projects without explicit permission. A standard open-source license is not specified.

Limitations & Caveats

The official implementation code is announced for release soon, meaning the project is not yet available for direct use. The README provides no details on specific hardware requirements, setup procedures, or potential limitations beyond the pending code release.

RLP by NVlabs

Explore Similar Projects

Awesome-Long2short-on-LRMs by Hongcheng-Gao

XBai-o4 by MetaStone-AI

l1 by cmu-l3

Awesome-Efficient-Reasoning by hemingkx

M_GRPO by baibizhe

MiMo by XiaomiMiMo

X-R1 by dhcode-cpp

PRIME by PRIME-RL

train-deepseek-r1 by FareedKhan-dev

simpleRL-reason by hkust-nlp

HRM by sapientinc

DeepSeek-R1 by deepseek-ai