ELLA  by TencentQQGYLab

Text-to-image model enhancing semantic alignment using LLMs

created 1 year ago
1,228 stars

Top 32.7% on sourcepulse

GitHubView on GitHub
Project Summary

ELLA enhances diffusion models by integrating Large Language Models (LLMs) for improved semantic alignment in text-to-image generation. It targets researchers and developers seeking more nuanced control and understanding of prompts, offering better adherence to complex descriptions and styles.

How It Works

ELLA equips diffusion models with LLMs to process and refine text prompts, leading to more accurate image generation. It leverages LLMs for "caption upsampling," expanding short prompts into detailed descriptions that capture color, shape, and spatial relationships. This approach aims to overcome limitations of standard text encoders by providing richer semantic conditioning to the diffusion model's UNet.

Quick Start & Requirements

  • Install via pip (dependencies not explicitly listed, but PyTorch and Hugging Face libraries are implied).
  • Download ELLA models from Hugging Face (e.g., ella-sd1.5-tsc-t5xl.safetensors).
  • Inference command: python3 inference.py test --save_folder ./assets/ella-inference-examples --ella_path /path/to/ella-sd1.5-tsc-t5xl.safetensors
  • Demo command: GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=8082 python3 ./inference.py demo /path/to/ella-sd1.5-tsc-t5xl.safetensors
  • ComfyUI plugin available: TencentQQGYLab/ComfyUI-ELLA.

Highlighted Details

  • Improves prompt adherence through LLM-powered caption upsampling.
  • Supports "flexible token length" for better handling of short prompts.
  • Offers a method to integrate with CLIP-based community models for style preservation.
  • Recommends using FlanT5 in fp16 mode for optimal results.

Maintenance & Community

  • Active development with recent updates (June 2024) including EMMA (multi-modal adapter).
  • ComfyUI plugins available from the authors and third parties.
  • Community suggestions are welcomed via GitHub issues.

Licensing & Compatibility

  • License not explicitly stated in the README.
  • Compatibility with community models is an ongoing research area.

Limitations & Caveats

ELLA is in early research stages with limited comprehensive testing. The README notes potential style loss with CLIP-reliant community models and recommends specific inference configurations (fp16 for FlanT5).

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
40 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.