LLaDA by ML-GSAI

LLM research paper exploring masked diffusion language models

Created 11 months ago

3,473 stars

Top 13.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Wing Lian

Founder of Axolotl AI

Project Summary

LLaDA is an 8B parameter PyTorch-based diffusion model for natural language processing, designed to rival LLaMA3 8B performance. It targets researchers and developers exploring alternative LLM architectures beyond autoregressive models, offering a theoretically grounded approach to generative language modeling with capabilities like in-context learning and instruction following.

How It Works

LLaDA employs a masked diffusion model approach, differing from autoregressive models like GPT. It uses a Transformer architecture but models language probabilistically via diffusion, with a randomly varying masking ratio during training. This approach is theoretically shown to be an upper bound on the negative log-likelihood, enabling generative capabilities and Fisher consistency for scalability.

Quick Start & Requirements

Install: pip install transformers==4.38.2 gradio
Load Model: Use transformers library to load GSAI-ML/LLaDA-8B-Base or GSAI-ML/LLaDA-8B-Instruct.
Demo: Run python app.py after installing Gradio.
Dependencies: PyTorch, transformers, gradio.
Resources: Requires significant VRAM for the 8B model, likely with torch_dtype=torch.bfloat16.
Docs: GUIDELINES.md, paper, demo

Highlighted Details

8B parameter scale, trained from scratch, competitive with LLaMA3 8B.
Explores masked diffusion models as a theoretically complete language modeling approach.
Supports conditional likelihood evaluation and conditional generation.
Offers multi-round conversation via chat.py.

Maintenance & Community

Official PyTorch implementation.
Mentions contributions from "apolinário" for the Gradio demo.
Further details on training and development can be found in prior works: RADD and SMDM.

Licensing & Compatibility

The README does not explicitly state a license. The citation is for an arXiv preprint.

Limitations & Caveats

Sampling speed is currently slower than autoregressive models due to fixed context length, lack of KV-Cache, and optimal performance requiring sampling steps equal to response length. The project aims to optimize efficiency in future work.

Health Check

Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

100 stars in the last 30 days