IF  by deep-floyd

Text-to-image model for photorealistic synthesis and language understanding

created 2 years ago
7,840 stars

Top 6.8% on sourcepulse

GitHubView on GitHub
Project Summary

DeepFloyd IF is a modular, cascaded diffusion model for high-fidelity text-to-image generation, targeting researchers and developers seeking state-of-the-art photorealism and language understanding. It offers a flexible architecture for various creative applications, including image generation, style transfer, super-resolution, and inpainting.

How It Works

IF employs a three-stage cascaded diffusion process. A frozen T5 text encoder generates embeddings, which are fed into a UNet-based base model (IF-I) producing 64x64 images. Two subsequent super-resolution diffusion models (IF-II and Stable x4 upscaler) progressively increase the resolution to 256x256 and 1024x1024, respectively. This cascaded approach, particularly the use of larger UNet architectures in the initial stage, is key to achieving high photorealism and detailed outputs.

Quick Start & Requirements

  • Install: pip install deepfloyd_if==1.0.2rc0 xformers==0.0.16 git+https://github.com/openai/CLIP.git --no-deps
  • Prerequisites: Hugging Face account, login via huggingface_hub, torch>=2.0.0 (with enable_xformers_memory_efficient_attention() removed), accelerate, transformers, safetensors.
  • VRAM: Minimum 16GB for IF-I-XL and IF-II-L; 24GB for all three stages (IF-I-XL, IF-II-L, Stable x4).
  • Docs: IF blog post, Diffusers integration

Highlighted Details

  • Achieves a zero-shot FID score of 6.66 on COCO.
  • Supports text-to-image, style transfer, super-resolution, and inpainting.
  • Integrates with Hugging Face Diffusers for customizable pipelines and CPU offloading for lower VRAM usage.
  • Parameter-efficient fine-tuning is supported for adding new concepts.

Maintenance & Community

  • Developed by DeepFloyd Lab at StabilityAI.
  • Key contributors include Alex Shonenkov, Misha Konstantinov, Daria Bakshandaeva, Christoph Schuhmann, Ksenia Ivanova, and Nadiia Klokova.
  • Significant contributions from external community members like @Apolinário and @patrickvonplaten are acknowledged.

Licensing & Compatibility

  • Code released under a "bespoke license" with an initial restricted research-purposes-only license for the model weights, with plans for a fully open-source release.
  • Compatibility for commercial use is not explicitly stated for the initial release.

Limitations & Caveats

The initial release of IF model weights is under a restricted research-purposes-only license. The model has known limitations and biases, which are detailed in the model card.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
50 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
7 more.

stable-dreamfusion by ashawkey

0.1%
9k
Text-to-3D model using NeRF and diffusion
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
12 more.

stablediffusion by Stability-AI

0.1%
41k
Latent diffusion model for high-resolution image synthesis
created 2 years ago
updated 1 month ago
Starred by Dan Abramov Dan Abramov(Core Contributor to React), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
28 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
created 3 years ago
updated 1 year ago
Feedback? Help us improve.