LUFFY by ElliottYan

Framework for off-policy learning in large reasoning models

Created 7 months ago

379 stars

Top 75.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yiran Wu

Coauthor of AutoGen

Project Summary

LUFFY is a reinforcement learning framework designed to enhance large reasoning models by integrating off-policy guidance. It targets researchers and developers working on improving the reasoning capabilities of LLMs, offering a method to leverage external reasoning traces for more effective training.

How It Works

LUFFY builds upon the GRPO framework, combining on-policy rollouts with off-policy demonstrations. It introduces a novel approach to advantage estimation by incorporating these external traces and employs policy shaping via regularized importance sampling. This allows LUFFY to dynamically balance imitation and exploration, emphasizing crucial but low-probability actions for better generalization.

Quick Start & Requirements

Installation: Requires Python 3.10, Conda environment setup, and installation via pip install -r requirements.txt and pip install -e . for the main package and verl.
Prerequisites: Flash-attention (specific version recommended, with a direct download link provided) and vLLM for inference. CUDA is implied for flash-attention.
Resources: Training scripts are provided, and inference uses vLLM. Specific hardware requirements are not detailed but are typical for large model training.
Links: Paper, Hugging Face Collection, Inference Example.

Highlighted Details

Achieves state-of-the-art results on multiple math reasoning benchmarks, outperforming SFT and other zero-RL methods.
Demonstrates strong generalization to out-of-distribution tasks.
Offers a framework for integrating off-policy traces from models like DeepSeek-R1.
Provides implementations for SFT, RL w/ SFT Loss, and SFT+RL baselines for comparison.

Maintenance & Community

The project is actively maintained, with recent updates integrating more baseline implementations and re-evaluating models. Contact information for the authors is provided.

Licensing & Compatibility

The repository does not explicitly state a license. The project utilizes components from other open-source projects, and users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The project is described as having an "alpha" status. While comprehensive benchmarks are provided, specific hardware requirements for training and detailed performance metrics beyond benchmark scores are not extensively documented. The license is not specified, which may pose a barrier for some users.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days