mindall-e by kakaobrain

PyTorch for text-to-image generation research

Created 4 years ago

634 stars

Top 52.3% on SourcePulse

View on GitHub

4 Experts Love This Project

Johannes Hagemann

Cofounder of Prime Intellect

Chenlin Meng

Cofounder of Pika

Robin Rombach

Cofounder of Black Forest Labs

Alexander Borzunov

Research Scientist at OpenAI

Project Summary

minDALL-E is a 1.3 billion parameter text-to-image generation model, offering a PyTorch implementation for researchers and developers. It aims to provide a capable alternative to larger models by leveraging a two-stage autoregressive approach trained on 14 million image-text pairs from Conceptual Captions.

How It Works

This model employs a two-stage autoregressive generation process. Stage 1 replaces the original DALL-E's Discrete VAE with VQGAN for efficient high-quality sample generation, fine-tuning the VQGAN model on FFHQ and ImageNet. Stage 2 utilizes a 1.3B transformer trained from scratch on a large dataset of image-text pairs. Generated images are re-ranked using OpenAI's CLIP for improved relevance.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: PyTorch == 1.8.0, CUDA >= 10.1. Requires ~5GB for model checkpoints.
Demo: Interactive demo available at examples/sampling_interactive_demo.ipynb.
Code: Full sampling example at examples/sampling_ex.py.

Highlighted Details

Achieves competitive FID-50K scores (15.55 class-conditional, 37.58 unconditional) on ImageNet, outperforming VQ-GAN and ImageBART in some benchmarks.
Demonstrates strong transfer learning capabilities, fine-tuning effectively for class-conditional and unconditional generation.
Utilizes VQGAN in the first stage for improved sample quality compared to the original DALL-E's approach.
CLIP re-ranking is employed to select the best candidate images from multiple generations.

Maintenance & Community

Developed by Kakaobrain. Contact available at contact@kakaobrain.com for collaboration or feedback.

Licensing & Compatibility

Source code licensed under Apache 2.0.
Pretrained weights licensed under CC-BY-NC-SA 4.0. This license restricts commercial use and derivative works without similar sharing.

Limitations & Caveats

The model's training on a smaller dataset (14M pairs) may make it vulnerable to prompt engineering for generating socially unacceptable content. The CC-BY-NC-SA 4.0 license on pretrained weights may restrict commercial applications.

Health Check

Last Commit

3 years ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days