dLLM-RL by Gen-Verse

Revolutionary RL framework for diffusion language models

Created 4 months ago

381 stars

Top 74.9% on SourcePulse

Project Summary

Summary

Gen-Verse/dLLM-RL introduces TraceRL and TraDo-8B, a novel reinforcement learning framework and a suite of diffusion language models (DLMs) designed to advance RL capabilities for DLMs. Targeting researchers and practitioners, this project offers state-of-the-art performance in complex reasoning tasks like mathematics and coding, aiming to revolutionize how DLMs are trained and optimized for generative tasks.

How It Works

The framework's core is TraceRL, a trajectory-aware reinforcement learning method that incorporates a diffusion-based value model. This combination significantly reduces variance and improves stability during optimization, a key challenge in DLM training. Based on TraceRL, the TraDo model series (e.g., TraDo-4B-Instruct, TraDo-8B-Instruct, TraDo-8B-Thinking) achieves state-of-the-art results on math and coding reasoning benchmarks, directly challenging traditional autoregressive models with its diffusion-based approach.

Quick Start & Requirements

Installation: Setup involves Conda for environment management (conda create --name dllm-rl python=3.10, source activate dllm-rl), followed by pip installations for PyTorch (torch==2.6.0) and a specific FlashAttention wheel (flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl), and then pip install -r requirements.txt.
Prerequisites: Python 3.10, PyTorch 2.6.0, and CUDA 12 are required. Multi-node setups may have additional hardware dependencies.
Data: Datasets can be downloaded using python download_data.py --dataset <dataset_name> (e.g., MATH500).
Configuration: Users must select or create configuration files in the ./configs directory.
Links: No direct links to external documentation, demos, or quick-start guides are provided within the provided documentation.

Highlighted Details

Broad model support: Compatible with diverse DLM architectures including full attention, adapted, and block attention models (e.g., TraDo, SDAR, Dream, LLaDA, MMaDA, Diffu-Coder).
Inference acceleration: Features include optimized KV-cache, jetengine (based on nano-vllm), various sampling strategies, and robust multi-node inference capabilities.
Advanced RL training: Implements TraceRL (with optional diffusion value model), coupled RL, and random masking RL, all benefiting from KV-cache acceleration.
Flexible SFT: Supports Block SFT, semi-AR SFT, and random masking SFT, with multi-node and long-CoT finetuning options.

Maintenance & Community

Information regarding community channels (e.g., Discord, Slack), project roadmaps, or notable contributors is not present in the provided documentation.

dLLM-RL by Gen-Verse

Explore Similar Projects

Ling-V2 by inclusionAI

EXAONE-Deep by LG-AI-EXAONE

arbor by Ziems

recurrentgemma by google-deepmind

ReasonFlux by Gen-Verse

gigaGPT by Cerebras

reasoning-with-sampling by aakaran

Light-R1 by Qihoo360

gpt-oss-recipes by huggingface

alpaca_lora_4bit by johnsmith0031

OLMo-core by allenai

Skills by NVIDIA-NeMo