Entropy-Mechanism-of-RL  by PRIME-RL

LLM reasoning enhanced with RL entropy control

Created 3 months ago
327 stars

Top 83.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository addresses the "entropy collapse" issue in reinforcement learning (RL) for large language models (LLMs), which hinders reasoning capabilities by causing overconfidence and performance saturation. It targets researchers and practitioners working with LLMs and RL, offering methods to improve model exploration and performance.

How It Works

The project identifies a negative exponential relationship between policy entropy and performance, suggesting entropy exhaustion bottlenecks LLM reasoning. It theoretically links entropy decline to the covariance between action probabilities and logit updates, which is typically positive and drives entropy reduction. To counter this, the proposed Clip-Cov and KL-Cov methods restrict updates for high-covariance tokens, effectively preventing entropy collapse and enhancing performance.

Quick Start & Requirements

  • Installation: conda env create -n entropy -f environment.yaml
  • Prerequisites: Conda, Python, PyTorch, Qwen2.5 models, AIME, AIME25, and AMC datasets.
  • Training example: bash recipe/dapo/7b_kl_cov.sh for Qwen2.5-7B on a single node.
  • Multi-node training example: bash recipe/dapo/32b_kl_cov.sh for Qwen2.5-32B.
  • Documentation: https://github.com/PRIME-RL/Entropy-Mechanism-of-RL

Highlighted Details

  • Achieves over 10x higher entropy compared to baselines on plateauing models.
  • Demonstrates non-trivial performance improvements across benchmarks, with up to 6.4% gains on 32B models and 15.0% on challenging math reasoning tasks (AIME24/25).
  • Code is forked from verl and built on the dapo recipe.
  • Utilizes vLLM for inference and Qwen2.5 models for training.

Maintenance & Community

  • The project has active development with recent updates merging KL_Cov and Clip_Cov into verl.
  • Links to relevant discussions and announcements are provided via Twitter.
  • Contact information for key researchers is available.

Licensing & Compatibility

  • The repository is licensed under the MIT License.
  • Compatibility with commercial or closed-source applications is generally permissive due to the MIT license.

Limitations & Caveats

The project is based on research and may be in an alpha or experimental stage. Specific dataset configurations ("data_source") are hardcoded for certain training scripts, requiring user modification for different datasets. Multi-node training setup might require specific environment variable configurations.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
35 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab).

Eureka by eureka-research

0.2%
3k
LLM-based reward design for reinforcement learning
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.