LLMGA  by dvlab-research

Multimodal LLM for image generation/editing, leveraging LLMs for detailed prompts

created 1 year ago
396 stars

Top 74.0% on sourcepulse

GitHubView on GitHub
Project Summary

LLMGA is a multimodal large language model-based generation assistant designed to help users create and edit images through conversational interactions. It targets researchers and developers interested in advanced image generation and editing capabilities, offering precise control over Stable Diffusion models via detailed language prompts.

How It Works

LLMGA leverages Large Language Models (LLMs) to generate detailed textual prompts that control Stable Diffusion (SD) models for image generation and editing tasks like text-to-image, inpainting, and outpainting. This approach enhances LLM context understanding, reduces prompt noise, and improves image detail and interpretability compared to methods using fixed-size embeddings. A two-stage training scheme is employed: first, training the MLLM to generate prompts, and second, optimizing SD to align with these prompts. A reference-based restoration network is also proposed to mitigate visual disparities in inpainting/outpainting.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n llmga python=3.9), activate it, navigate to the repo, and install dependencies using pip install -e . and pip install . in the llmga/diffusers directory. Additional packages for training (pip install -e ".[train]", pip install -r requirements.txt, pip install flash-attn, pip install datasets, pip install albumentations, pip install ninja) are recommended for full functionality.
  • Prerequisites: Python 3.9, conda, PyTorch. Training requires 8x A100 GPUs (80GB). Inference supports multiple GPUs and 4-bit/8-bit quantization.
  • Models & Data: Requires downloading MLLM and SD models, as well as curated training datasets (COCO, GQA, OCR-VQA, TextVQA, VisualGenome, LLaVA datasets).
  • Links: Huggingface for models, demo (LLMGA7b-SDXL-T2I).

Highlighted Details

  • Supports various LLM backbones including Vicuna, Mistral, Llama3, Qwen2, Phi3, and Gemma.
  • Offers multilingual support, with Chinese-enhanced models available.
  • Integrates with external plugins like ControlNet for expanded functionality.
  • Provides fine-tuned SD1.5 and SDXL models for T2I and inpainting tasks.

Maintenance & Community

The project is actively updated, with recent releases in July 2024 including new models, datasets, and code updates. It is associated with dvlab-research.

Licensing & Compatibility

The README does not explicitly state the license. However, it mentions that some base LLM models have commercial licenses, suggesting potential compatibility for commercial use depending on the chosen LLM.

Limitations & Caveats

Training is resource-intensive, requiring multiple high-end GPUs. While inference supports quantization, detailed performance benchmarks for various configurations are not provided. The project relies on external datasets that require manual download and organization.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.