LLaDA  by ML-GSAI

LLM research paper exploring masked diffusion language models

created 5 months ago
2,648 stars

Top 18.2% on sourcepulse

GitHubView on GitHub
Project Summary

LLaDA is an 8B parameter PyTorch-based diffusion model for natural language processing, designed to rival LLaMA3 8B performance. It targets researchers and developers exploring alternative LLM architectures beyond autoregressive models, offering a theoretically grounded approach to generative language modeling with capabilities like in-context learning and instruction following.

How It Works

LLaDA employs a masked diffusion model approach, differing from autoregressive models like GPT. It uses a Transformer architecture but models language probabilistically via diffusion, with a randomly varying masking ratio during training. This approach is theoretically shown to be an upper bound on the negative log-likelihood, enabling generative capabilities and Fisher consistency for scalability.

Quick Start & Requirements

  • Install: pip install transformers==4.38.2 gradio
  • Load Model: Use transformers library to load GSAI-ML/LLaDA-8B-Base or GSAI-ML/LLaDA-8B-Instruct.
  • Demo: Run python app.py after installing Gradio.
  • Dependencies: PyTorch, transformers, gradio.
  • Resources: Requires significant VRAM for the 8B model, likely with torch_dtype=torch.bfloat16.
  • Docs: GUIDELINES.md, paper, demo

Highlighted Details

  • 8B parameter scale, trained from scratch, competitive with LLaMA3 8B.
  • Explores masked diffusion models as a theoretically complete language modeling approach.
  • Supports conditional likelihood evaluation and conditional generation.
  • Offers multi-round conversation via chat.py.

Maintenance & Community

  • Official PyTorch implementation.
  • Mentions contributions from "apolinário" for the Gradio demo.
  • Further details on training and development can be found in prior works: RADD and SMDM.

Licensing & Compatibility

  • The README does not explicitly state a license. The citation is for an arXiv preprint.

Limitations & Caveats

Sampling speed is currently slower than autoregressive models due to fixed context length, lack of KV-Cache, and optimal performance requiring sampling steps equal to response length. The project aims to optimize efficiency in future work.

Health Check
Last commit

4 days ago

Responsiveness

1 week

Pull Requests (30d)
2
Issues (30d)
2
Star History
1,131 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.