karlo  by kakaobrain

Text-to-image model based on unCLIP architecture

created 2 years ago
697 stars

Top 49.8% on sourcepulse

GitHubView on GitHub
Project Summary

Karlo is a text-conditional image generation model that addresses the challenge of producing high-quality images from text prompts with improved detail recovery in fewer denoising steps. It is based on OpenAI's unCLIP architecture and is suitable for researchers and developers interested in advanced diffusion models.

How It Works

Karlo utilizes an unCLIP architecture comprising prior, decoder, and super-resolution modules. It features an enhanced super-resolution module that upscales images from 64px to 256px in just 7 reverse steps. This is achieved by first using a DDPM-trained SR module for initial upscaling and then a VQ-GAN-style loss fine-tuned module for recovering high-frequency details, offering an efficient approach to detail enhancement.

Quick Start & Requirements

  • Install: pip install diffusers transformers accelerate safetensors
  • Prerequisites: PyTorch >= 1.10, CUDA >= 11. A single V100 with 32GB VRAM is recommended for sampling.
  • Model Weights: Download required checkpoints via wget commands or setup.sh.
  • Demo: Launch a Gradio demo with python demo/product_demo.py.
  • Docs: Diffusers unCLIP Pipeline Docs

Highlighted Details

  • Trained on 115M image-text pairs (COYO-100M, CC3M, CC12M).
  • Achieves CLIP-score of 0.3081 and FID of 14.37 on CC3M validation set with 25 decoder steps.
  • Uses ViT-L/14 from CLIP for prior and decoder, with a modified text encoder for efficiency.
  • Integrated into Hugging Face's diffusers library.

Maintenance & Community

  • Released as Karlo-v1.0.alpha on 2022-12-01.
  • Integrated into diffusers and Huggingface Spaces.
  • Contact: contact@kakaobrain.com for collaboration or feedback.

Licensing & Compatibility

  • License: CreativeML Open RAIL-M.
  • Commercial use is permitted, but a robust safe checker is recommended.

Limitations & Caveats

  • This is an alpha version.
  • The README notes that the second run in the Gradio demo can be unexpectedly slower due to CUDA kernel launch times, potentially up to 2 minutes.
Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
12 more.

stablediffusion by Stability-AI

0.1%
41k
Latent diffusion model for high-resolution image synthesis
created 2 years ago
updated 1 month ago
Starred by Dan Abramov Dan Abramov(Core Contributor to React), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
28 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
created 3 years ago
updated 1 year ago
Feedback? Help us improve.