LUFFY  by ElliottYan

Framework for off-policy learning in large reasoning models

Created 4 months ago
286 stars

Top 91.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LUFFY is a reinforcement learning framework designed to enhance large reasoning models by integrating off-policy guidance. It targets researchers and developers working on improving the reasoning capabilities of LLMs, offering a method to leverage external reasoning traces for more effective training.

How It Works

LUFFY builds upon the GRPO framework, combining on-policy rollouts with off-policy demonstrations. It introduces a novel approach to advantage estimation by incorporating these external traces and employs policy shaping via regularized importance sampling. This allows LUFFY to dynamically balance imitation and exploration, emphasizing crucial but low-probability actions for better generalization.

Quick Start & Requirements

  • Installation: Requires Python 3.10, Conda environment setup, and installation via pip install -r requirements.txt and pip install -e . for the main package and verl.
  • Prerequisites: Flash-attention (specific version recommended, with a direct download link provided) and vLLM for inference. CUDA is implied for flash-attention.
  • Resources: Training scripts are provided, and inference uses vLLM. Specific hardware requirements are not detailed but are typical for large model training.
  • Links: Paper, Hugging Face Collection, Inference Example.

Highlighted Details

  • Achieves state-of-the-art results on multiple math reasoning benchmarks, outperforming SFT and other zero-RL methods.
  • Demonstrates strong generalization to out-of-distribution tasks.
  • Offers a framework for integrating off-policy traces from models like DeepSeek-R1.
  • Provides implementations for SFT, RL w/ SFT Loss, and SFT+RL baselines for comparison.

Maintenance & Community

The project is actively maintained, with recent updates integrating more baseline implementations and re-evaluating models. Contact information for the authors is provided.

Licensing & Compatibility

The repository does not explicitly state a license. The project utilizes components from other open-source projects, and users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The project is described as having an "alpha" status. While comprehensive benchmarks are provided, specific hardware requirements for training and detailed performance metrics beyond benchmark scores are not extensively documented. The license is not specified, which may pose a barrier for some users.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Shizhe Diao Shizhe Diao(Research Scientist at NVIDIA; Author of LMFlow), and
4 more.

simpleRL-reason by hkust-nlp

0.1%
4k
RL recipe for reasoning ability in models
Created 7 months ago
Updated 3 weeks ago
Feedback? Help us improve.