e4t-diffusion by mkshing

Diffusion implementation for fast text-to-image model personalization

Created 2 years ago

324 stars

Top 84.1% on SourcePulse

Project Summary

This repository provides an implementation of Encoder-based Domain Tuning (E4T) for fast personalization of text-to-image models, specifically targeting the Hugging Face diffusers library. It enables users to quickly adapt large pre-trained diffusion models to specific domains or styles with minimal training data and steps, benefiting researchers and artists looking to customize image generation.

How It Works

E4T employs a novel approach by pre-training an encoder that learns domain-specific embeddings. This encoder is then integrated into the diffusion model's architecture. During domain tuning, only this encoder and optionally the text encoder are fine-tuned, drastically reducing training time and data requirements compared to methods like Dreamboho. The method leverages Stable unCLIP for data augmentation to enhance results.

Quick Start & Requirements

Install via pip install -r requirements.txt after cloning the repository.
Requires Python, PyTorch, diffusers, accelerate, and xformers for memory-efficient attention.
Pre-trained models are available, with an example for face generation trained on FFHQ+CelebA-HQ.
Official documentation and model zoo links are not explicitly provided, but the README details pre-training, domain-tuning, and inference commands.

Highlighted Details

Achieves fast personalization with reportedly <15 training steps for domain tuning.
Supports pre-training on custom datasets (e.g., WikiArt) and domain-tuning with user-provided images.
Offers flexibility in choosing CLIP models and fine-tuning strategies (e.g., unfreezing CLIP vision).
Includes options for mixed precision (fp16) and memory-efficient attention (xformers).

Maintenance & Community

The project is associated with research published on arXiv (arXiv.org perpetual, non-exclusive license).
Stability AI provided resources for testing and training.
No explicit community channels (Discord/Slack) or roadmap are mentioned.

Licensing & Compatibility

The project's license is not explicitly stated in the README, but the associated arXiv paper has a "perpetual, non-exclusive license" from arXiv.org.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is still under development, with planned features like using face segmentation networks for human face domains and supporting ToMe for more efficient training. The exact licensing for the codebase itself requires clarification.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days