e4t-diffusion  by mkshing

Diffusion implementation for fast text-to-image model personalization

created 2 years ago
324 stars

Top 85.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an implementation of Encoder-based Domain Tuning (E4T) for fast personalization of text-to-image models, specifically targeting the Hugging Face diffusers library. It enables users to quickly adapt large pre-trained diffusion models to specific domains or styles with minimal training data and steps, benefiting researchers and artists looking to customize image generation.

How It Works

E4T employs a novel approach by pre-training an encoder that learns domain-specific embeddings. This encoder is then integrated into the diffusion model's architecture. During domain tuning, only this encoder and optionally the text encoder are fine-tuned, drastically reducing training time and data requirements compared to methods like Dreamboho. The method leverages Stable unCLIP for data augmentation to enhance results.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires Python, PyTorch, diffusers, accelerate, and xformers for memory-efficient attention.
  • Pre-trained models are available, with an example for face generation trained on FFHQ+CelebA-HQ.
  • Official documentation and model zoo links are not explicitly provided, but the README details pre-training, domain-tuning, and inference commands.

Highlighted Details

  • Achieves fast personalization with reportedly <15 training steps for domain tuning.
  • Supports pre-training on custom datasets (e.g., WikiArt) and domain-tuning with user-provided images.
  • Offers flexibility in choosing CLIP models and fine-tuning strategies (e.g., unfreezing CLIP vision).
  • Includes options for mixed precision (fp16) and memory-efficient attention (xformers).

Maintenance & Community

  • The project is associated with research published on arXiv (arXiv.org perpetual, non-exclusive license).
  • Stability AI provided resources for testing and training.
  • No explicit community channels (Discord/Slack) or roadmap are mentioned.

Licensing & Compatibility

  • The project's license is not explicitly stated in the README, but the associated arXiv paper has a "perpetual, non-exclusive license" from arXiv.org.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is still under development, with planned features like using face segmentation networks for human face domains and supporting ToMe for more efficient training. The exact licensing for the codebase itself requires clarification.

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.