ELLA by TencentQQGYLab

Text-to-image model enhancing semantic alignment using LLMs

Created 1 year ago

1,271 stars

Top 31.1% on SourcePulse

Project Summary

ELLA enhances diffusion models by integrating Large Language Models (LLMs) for improved semantic alignment in text-to-image generation. It targets researchers and developers seeking more nuanced control and understanding of prompts, offering better adherence to complex descriptions and styles.

How It Works

ELLA equips diffusion models with LLMs to process and refine text prompts, leading to more accurate image generation. It leverages LLMs for "caption upsampling," expanding short prompts into detailed descriptions that capture color, shape, and spatial relationships. This approach aims to overcome limitations of standard text encoders by providing richer semantic conditioning to the diffusion model's UNet.

Quick Start & Requirements

Install via pip (dependencies not explicitly listed, but PyTorch and Hugging Face libraries are implied).
Download ELLA models from Hugging Face (e.g., ella-sd1.5-tsc-t5xl.safetensors).
Inference command: python3 inference.py test --save_folder ./assets/ella-inference-examples --ella_path /path/to/ella-sd1.5-tsc-t5xl.safetensors
Demo command: GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=8082 python3 ./inference.py demo /path/to/ella-sd1.5-tsc-t5xl.safetensors
ComfyUI plugin available: TencentQQGYLab/ComfyUI-ELLA.

Highlighted Details

Improves prompt adherence through LLM-powered caption upsampling.
Supports "flexible token length" for better handling of short prompts.
Offers a method to integrate with CLIP-based community models for style preservation.
Recommends using FlanT5 in fp16 mode for optimal results.

Maintenance & Community

Active development with recent updates (June 2024) including EMMA (multi-modal adapter).
ComfyUI plugins available from the authors and third parties.
Community suggestions are welcomed via GitHub issues.

Licensing & Compatibility

License not explicitly stated in the README.
Compatibility with community models is an ongoing research area.

Limitations & Caveats

ELLA is in early research stages with limited comprehensive testing. The README notes potential style loss with CLIP-reliant community models and recommends specific inference configurations (fp16 for FlanT5).

ELLA by TencentQQGYLab

Explore Similar Projects

ShareGPT-4o-Image by FreedomIntelligence

OmniGen2 by VectorSpaceLab

LaVi-Bridge by ShihaoZhaoZSH

MAGIC by yxuansu

UltraPixel by catcathh

MILS by facebookresearch

diffusion-self-distillation by primecai

Lumina-mGPT-2.0 by Alpha-VLLM

TediGAN by IIGROUP

RPG-DiffusionMaster by YangLing0818

HunyuanDiT by Tencent-Hunyuan

IP-Adapter by tencent-ailab