SMDM by ML-GSAI

PyTorch code for masked diffusion model research paper

Created 1 year ago

355 stars

Top 78.8% on SourcePulse

Project Summary

This repository provides the official PyTorch implementation for "Scaling up Masked Diffusion Models on Text," a research paper exploring the scalability and effectiveness of Masked Diffusion Models (MDMs) in language tasks. It targets researchers and practitioners interested in advancing text generation and understanding beyond traditional autoregressive models, offering competitive performance and unique advantages in bidirectional reasoning and temporal adaptation.

How It Works

The project implements Masked Diffusion Models (MDMs) for text, a probabilistic approach that demonstrates scaling laws comparable to autoregressive models (ARMs) with a smaller compute gap. It introduces unsupervised classifier-free guidance leveraging unpaired data for conditional inference. The architecture is designed to handle bidirectional reasoning and temporal shifts, addressing limitations found in ARMs.

Quick Start & Requirements

Installation: Requires an Anaconda environment, potentially based on TinyLlama. Install with pip install lm-eval==0.4.4 numpy==1.25.0 bitsandbytes==0.43.1 openai==0.28 fschat==0.2.34 anthropic. Conda installation commands are available in CONDA.md.
Prerequisites: PyTorch, CUDA, Python. Specific dataset preprocessing (SlimPajama, ShareGPT, GSM8K, FineWeb) is required.
Resources: Training commands indicate multi-GPU (8+) and multi-node setups are supported for large models (up to 1.1B parameters).
Links: Pretrained models are available on Huggingface.

Highlighted Details

A 1.1B MDM outperforms TinyLlama on zero-shot benchmarks and matches Llama-2 7B on GSM8K.
MDMs offer a 1.4x speedup over ARMs at comparable performance or higher quality at increased cost.
MDMs successfully address the "reverse curse" problem, outperforming much larger ARMs.
The project provides implementations for training ARMs and MDMs, fine-tuning for specific tasks (math reasoning, conditional generation), and evaluation across various benchmarks.

Maintenance & Community

The project is associated with the ICLR2025 paper "Scaling up Masked Diffusion Models on Text." Links to specific model checkpoints and evaluation scripts are provided.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The setup requires significant data preprocessing and potentially complex environment management (e.g., separate Anaconda environment for FineWeb dataset preprocessing). Specific version requirements for some dependencies might exist.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days