PyTorch for text-to-image generation research
Top 53.2% on sourcepulse
minDALL-E is a 1.3 billion parameter text-to-image generation model, offering a PyTorch implementation for researchers and developers. It aims to provide a capable alternative to larger models by leveraging a two-stage autoregressive approach trained on 14 million image-text pairs from Conceptual Captions.
How It Works
This model employs a two-stage autoregressive generation process. Stage 1 replaces the original DALL-E's Discrete VAE with VQGAN for efficient high-quality sample generation, fine-tuning the VQGAN model on FFHQ and ImageNet. Stage 2 utilizes a 1.3B transformer trained from scratch on a large dataset of image-text pairs. Generated images are re-ranked using OpenAI's CLIP for improved relevance.
Quick Start & Requirements
pip install -r requirements.txt
examples/sampling_interactive_demo.ipynb
.examples/sampling_ex.py
.Highlighted Details
Maintenance & Community
Developed by Kakaobrain. Contact available at contact@kakaobrain.com for collaboration or feedback.
Licensing & Compatibility
Limitations & Caveats
The model's training on a smaller dataset (14M pairs) may make it vulnerable to prompt engineering for generating socially unacceptable content. The CC-BY-NC-SA 4.0 license on pretrained weights may restrict commercial applications.
3 years ago
Inactive