Entropy-Mechanism-of-RL by PRIME-RL

LLM reasoning enhanced with RL entropy control

Created 7 months ago

407 stars

Top 71.6% on SourcePulse

Project Summary

This repository addresses the "entropy collapse" issue in reinforcement learning (RL) for large language models (LLMs), which hinders reasoning capabilities by causing overconfidence and performance saturation. It targets researchers and practitioners working with LLMs and RL, offering methods to improve model exploration and performance.

How It Works

The project identifies a negative exponential relationship between policy entropy and performance, suggesting entropy exhaustion bottlenecks LLM reasoning. It theoretically links entropy decline to the covariance between action probabilities and logit updates, which is typically positive and drives entropy reduction. To counter this, the proposed Clip-Cov and KL-Cov methods restrict updates for high-covariance tokens, effectively preventing entropy collapse and enhancing performance.

Quick Start & Requirements

Installation: conda env create -n entropy -f environment.yaml
Prerequisites: Conda, Python, PyTorch, Qwen2.5 models, AIME, AIME25, and AMC datasets.
Training example: bash recipe/dapo/7b_kl_cov.sh for Qwen2.5-7B on a single node.
Multi-node training example: bash recipe/dapo/32b_kl_cov.sh for Qwen2.5-32B.
Documentation: https://github.com/PRIME-RL/Entropy-Mechanism-of-RL

Highlighted Details

Achieves over 10x higher entropy compared to baselines on plateauing models.
Demonstrates non-trivial performance improvements across benchmarks, with up to 6.4% gains on 32B models and 15.0% on challenging math reasoning tasks (AIME24/25).
Code is forked from verl and built on the dapo recipe.
Utilizes vLLM for inference and Qwen2.5 models for training.

Maintenance & Community

The project has active development with recent updates merging KL_Cov and Clip_Cov into verl.
Links to relevant discussions and announcements are provided via Twitter.
Contact information for key researchers is available.

Licensing & Compatibility

The repository is licensed under the MIT License.
Compatibility with commercial or closed-source applications is generally permissive due to the MIT license.

Limitations & Caveats

The project is based on research and may be in an alpha or experimental stage. Specific dataset configurations ("data_source") are hardcoded for certain training scripts, requiring user modification for different datasets. Multi-node training setup might require specific environment variable configurations.

Entropy-Mechanism-of-RL by PRIME-RL

Explore Similar Projects

Minimal-RL by RLHFlow

awesome-in-context-rl by dunnolab

awesome-exploration-rl by opendilab

Reinforcement-Learning-Papers by yingchengyang

machina by DeepX-inc

LUFFY by ElliottYan

ReST-MCTS by THUDM

LlamaGym by KhoomeiK

M_GRPO by baibizhe

pytorch-rl by bentrevett

deep_trader by deependersingla

Reinforcement-Learning by andri27-ts