mindall-e  by kakaobrain

PyTorch for text-to-image generation research

created 3 years ago
635 stars

Top 53.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

minDALL-E is a 1.3 billion parameter text-to-image generation model, offering a PyTorch implementation for researchers and developers. It aims to provide a capable alternative to larger models by leveraging a two-stage autoregressive approach trained on 14 million image-text pairs from Conceptual Captions.

How It Works

This model employs a two-stage autoregressive generation process. Stage 1 replaces the original DALL-E's Discrete VAE with VQGAN for efficient high-quality sample generation, fine-tuning the VQGAN model on FFHQ and ImageNet. Stage 2 utilizes a 1.3B transformer trained from scratch on a large dataset of image-text pairs. Generated images are re-ranked using OpenAI's CLIP for improved relevance.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: PyTorch == 1.8.0, CUDA >= 10.1. Requires ~5GB for model checkpoints.
  • Demo: Interactive demo available at examples/sampling_interactive_demo.ipynb.
  • Code: Full sampling example at examples/sampling_ex.py.

Highlighted Details

  • Achieves competitive FID-50K scores (15.55 class-conditional, 37.58 unconditional) on ImageNet, outperforming VQ-GAN and ImageBART in some benchmarks.
  • Demonstrates strong transfer learning capabilities, fine-tuning effectively for class-conditional and unconditional generation.
  • Utilizes VQGAN in the first stage for improved sample quality compared to the original DALL-E's approach.
  • CLIP re-ranking is employed to select the best candidate images from multiple generations.

Maintenance & Community

Developed by Kakaobrain. Contact available at contact@kakaobrain.com for collaboration or feedback.

Licensing & Compatibility

  • Source code licensed under Apache 2.0.
  • Pretrained weights licensed under CC-BY-NC-SA 4.0. This license restricts commercial use and derivative works without similar sharing.

Limitations & Caveats

The model's training on a smaller dataset (14M pairs) may make it vulnerable to prompt engineering for generating socially unacceptable content. The CC-BY-NC-SA 4.0 license on pretrained weights may restrict commercial applications.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
3 more.

guided-diffusion by openai

0.2%
7k
Image synthesis codebase for diffusion models
created 4 years ago
updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
4 more.

taming-transformers by CompVis

0.1%
6k
Image synthesis research paper using transformers
created 4 years ago
updated 1 year ago
Feedback? Help us improve.