minimal-text-diffusion  by madaan

Minimal implementation of diffusion models for text generation

created 2 years ago
388 stars

Top 75.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a minimal implementation of diffusion models for text generation, allowing users to train a model on a text corpus and generate new samples. It's suitable for researchers and practitioners interested in understanding or experimenting with diffusion models for sequential data.

How It Works

The project adapts diffusion model principles for text by treating text as a sequence of discrete tokens. It learns a diffusion process that gradually adds noise to token embeddings and a reverse process that denoises these embeddings to reconstruct the original text. The core innovation lies in simplifying existing image-based diffusion code, removing image-specific components, and focusing on text generation, making it more accessible for text-based applications.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (conda recommended for mpi4py, pytorch, torchvision, torchaudio, cudatoolkit=11.3).
  • Prerequisites: CUDA 11.3 is recommended.
  • Dataset: Uses data/simple.txt by default; custom datasets require tokenization via python src/utils/custom_tokenizer.py train-word-level <your_data_path>.
  • Training: bash scripts/train.sh
  • Inference: bash scripts/text_sample.sh <path_to_checkpoint> <num_diffusion_steps> <num_samples>
  • Links: Diffusion in action example, Checkpoint

Highlighted Details

  • Word-level tokenization is found to yield the most fluent outputs.
  • Initialization from scratch and fine-tuning embeddings are recommended for best results.
  • Supports classifier-guided diffusion for controllable generation.
  • The model is trained to predict the embedded input, with tied weights between the word embedding layer and the language model head.

Maintenance & Community

The project is a refactored version of Diffusion-LM and includes code from OpenAI's glide-text2im. No specific community channels or active maintainer information are provided in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive license allows for commercial use and integration with closed-source projects.

Limitations & Caveats

The "Gory details" section is marked with a TODO for cleanup and expansion. Classifier-guided sampling and further experiments are also listed as future work. The project's "minimal" nature suggests it may lack advanced features or robustness found in more comprehensive libraries.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n) and Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm).

mlx-gpt2 by pranavjad

0.5%
393
Minimal GPT-2 implementation for educational purposes
created 1 year ago
updated 1 year ago
Feedback? Help us improve.