SMDM  by ML-GSAI

PyTorch code for masked diffusion model research paper

created 9 months ago
265 stars

Top 97.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official PyTorch implementation for "Scaling up Masked Diffusion Models on Text," a research paper exploring the scalability and effectiveness of Masked Diffusion Models (MDMs) in language tasks. It targets researchers and practitioners interested in advancing text generation and understanding beyond traditional autoregressive models, offering competitive performance and unique advantages in bidirectional reasoning and temporal adaptation.

How It Works

The project implements Masked Diffusion Models (MDMs) for text, a probabilistic approach that demonstrates scaling laws comparable to autoregressive models (ARMs) with a smaller compute gap. It introduces unsupervised classifier-free guidance leveraging unpaired data for conditional inference. The architecture is designed to handle bidirectional reasoning and temporal shifts, addressing limitations found in ARMs.

Quick Start & Requirements

  • Installation: Requires an Anaconda environment, potentially based on TinyLlama. Install with pip install lm-eval==0.4.4 numpy==1.25.0 bitsandbytes==0.43.1 openai==0.28 fschat==0.2.34 anthropic. Conda installation commands are available in CONDA.md.
  • Prerequisites: PyTorch, CUDA, Python. Specific dataset preprocessing (SlimPajama, ShareGPT, GSM8K, FineWeb) is required.
  • Resources: Training commands indicate multi-GPU (8+) and multi-node setups are supported for large models (up to 1.1B parameters).
  • Links: Pretrained models are available on Huggingface.

Highlighted Details

  • A 1.1B MDM outperforms TinyLlama on zero-shot benchmarks and matches Llama-2 7B on GSM8K.
  • MDMs offer a 1.4x speedup over ARMs at comparable performance or higher quality at increased cost.
  • MDMs successfully address the "reverse curse" problem, outperforming much larger ARMs.
  • The project provides implementations for training ARMs and MDMs, fine-tuning for specific tasks (math reasoning, conditional generation), and evaluation across various benchmarks.

Maintenance & Community

The project is associated with the ICLR2025 paper "Scaling up Masked Diffusion Models on Text." Links to specific model checkpoints and evaluation scripts are provided.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The setup requires significant data preprocessing and potentially complex environment management (e.g., separate Anaconda environment for FineWeb dataset preprocessing). Specific version requirements for some dependencies might exist.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
105 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

Medusa by FasterDecoding

0.2%
3k
Framework for accelerating LLM generation using multiple decoding heads
created 1 year ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 4 days ago
Feedback? Help us improve.