Minimal implementation of diffusion models for text generation
Top 75.0% on sourcepulse
This repository provides a minimal implementation of diffusion models for text generation, allowing users to train a model on a text corpus and generate new samples. It's suitable for researchers and practitioners interested in understanding or experimenting with diffusion models for sequential data.
How It Works
The project adapts diffusion model principles for text by treating text as a sequence of discrete tokens. It learns a diffusion process that gradually adds noise to token embeddings and a reverse process that denoises these embeddings to reconstruct the original text. The core innovation lies in simplifying existing image-based diffusion code, removing image-specific components, and focusing on text generation, making it more accessible for text-based applications.
Quick Start & Requirements
pip install -r requirements.txt
(conda recommended for mpi4py
, pytorch
, torchvision
, torchaudio
, cudatoolkit=11.3
).data/simple.txt
by default; custom datasets require tokenization via python src/utils/custom_tokenizer.py train-word-level <your_data_path>
.bash scripts/train.sh
bash scripts/text_sample.sh <path_to_checkpoint> <num_diffusion_steps> <num_samples>
Highlighted Details
Maintenance & Community
The project is a refactored version of Diffusion-LM and includes code from OpenAI's glide-text2im. No specific community channels or active maintainer information are provided in the README.
Licensing & Compatibility
Limitations & Caveats
The "Gory details" section is marked with a TODO for cleanup and expansion. Classifier-guided sampling and further experiments are also listed as future work. The project's "minimal" nature suggests it may lack advanced features or robustness found in more comprehensive libraries.
2 years ago
1 day