Text-to-image fine-tuning research paper
Top 22.9% on sourcepulse
Custom Diffusion enables fine-tuning text-to-image diffusion models like Stable Diffusion with a few images (~4-20) of a new concept. It targets researchers and developers looking to personalize generative AI models for specific objects, styles, or subjects, offering efficient customization with reduced storage overhead.
How It Works
The method fine-tunes only a subset of model parameters: the key and value projection matrices within the cross-attention layers. This selective fine-tuning is advantageous as it significantly speeds up the training process (around 6 minutes on 2 A100 GPUs) and reduces the storage required for each new concept to approximately 75MB. It also supports combining multiple concepts and merging fine-tuned models.
Quick Start & Requirements
stable-diffusion
repository, create and activate a conda environment (conda env create -f environment.yaml
, conda activate ldm
), and install dependencies (pip install clip-retrieval tqdm
).sd-v1-4.ckpt
). Training is recommended on 2 A100 GPUs.Highlighted Details
diffusers
library, including SDXL support.Maintenance & Community
The project is from Adobe Research and cites CVPR 2023. It is actively supported within the Hugging Face diffusers
library.
Licensing & Compatibility
The repository itself does not explicitly state a license. However, it relies on the stable-diffusion
repository, which is typically under a permissive license, and uses models from Hugging Face, which are also generally available for commercial use.
Limitations & Caveats
The original training scripts were developed against a specific commit of the stable-diffusion
repository, which might require careful version management. Fine-tuning on human faces may require adjusted hyperparameters (lower learning rate, longer training).
1 year ago
1 week