minimal-text-diffusion by madaan

Minimal implementation of diffusion models for text generation

Created 3 years ago

406 stars

Top 71.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Project Summary

This repository provides a minimal implementation of diffusion models for text generation, allowing users to train a model on a text corpus and generate new samples. It's suitable for researchers and practitioners interested in understanding or experimenting with diffusion models for sequential data.

How It Works

The project adapts diffusion model principles for text by treating text as a sequence of discrete tokens. It learns a diffusion process that gradually adds noise to token embeddings and a reverse process that denoises these embeddings to reconstruct the original text. The core innovation lies in simplifying existing image-based diffusion code, removing image-specific components, and focusing on text generation, making it more accessible for text-based applications.

Quick Start & Requirements

Install: pip install -r requirements.txt (conda recommended for mpi4py, pytorch, torchvision, torchaudio, cudatoolkit=11.3).
Prerequisites: CUDA 11.3 is recommended.
Dataset: Uses data/simple.txt by default; custom datasets require tokenization via python src/utils/custom_tokenizer.py train-word-level <your_data_path>.
Training: bash scripts/train.sh
Inference: bash scripts/text_sample.sh <path_to_checkpoint> <num_diffusion_steps> <num_samples>
Links: Diffusion in action example, Checkpoint

Highlighted Details

Word-level tokenization is found to yield the most fluent outputs.
Initialization from scratch and fine-tuning embeddings are recommended for best results.
Supports classifier-guided diffusion for controllable generation.
The model is trained to predict the embedded input, with tied weights between the word embedding layer and the language model head.

Maintenance & Community

The project is a refactored version of Diffusion-LM and includes code from OpenAI's glide-text2im. No specific community channels or active maintainer information are provided in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive license allows for commercial use and integration with closed-source projects.

Limitations & Caveats

The "Gory details" section is marked with a TODO for cleanup and expansion. Classifier-guided sampling and further experiments are also listed as future work. The project's "minimal" nature suggests it may lack advanced features or robustness found in more comprehensive libraries.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days