LLMGA by JIA-Lab-research

Multimodal LLM for image generation/editing, leveraging LLMs for detailed prompts

Created 2 years ago

396 stars

Top 72.9% on SourcePulse

Project Summary

LLMGA is a multimodal large language model-based generation assistant designed to help users create and edit images through conversational interactions. It targets researchers and developers interested in advanced image generation and editing capabilities, offering precise control over Stable Diffusion models via detailed language prompts.

How It Works

LLMGA leverages Large Language Models (LLMs) to generate detailed textual prompts that control Stable Diffusion (SD) models for image generation and editing tasks like text-to-image, inpainting, and outpainting. This approach enhances LLM context understanding, reduces prompt noise, and improves image detail and interpretability compared to methods using fixed-size embeddings. A two-stage training scheme is employed: first, training the MLLM to generate prompts, and second, optimizing SD to align with these prompts. A reference-based restoration network is also proposed to mitigate visual disparities in inpainting/outpainting.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n llmga python=3.9), activate it, navigate to the repo, and install dependencies using pip install -e . and pip install . in the llmga/diffusers directory. Additional packages for training (pip install -e ".[train]", pip install -r requirements.txt, pip install flash-attn, pip install datasets, pip install albumentations, pip install ninja) are recommended for full functionality.
Prerequisites: Python 3.9, conda, PyTorch. Training requires 8x A100 GPUs (80GB). Inference supports multiple GPUs and 4-bit/8-bit quantization.
Models & Data: Requires downloading MLLM and SD models, as well as curated training datasets (COCO, GQA, OCR-VQA, TextVQA, VisualGenome, LLaVA datasets).
Links: Huggingface for models, demo (LLMGA7b-SDXL-T2I).

Highlighted Details

Supports various LLM backbones including Vicuna, Mistral, Llama3, Qwen2, Phi3, and Gemma.
Offers multilingual support, with Chinese-enhanced models available.
Integrates with external plugins like ControlNet for expanded functionality.
Provides fine-tuned SD1.5 and SDXL models for T2I and inpainting tasks.

Maintenance & Community

The project is actively updated, with recent releases in July 2024 including new models, datasets, and code updates. It is associated with dvlab-research.

Licensing & Compatibility

The README does not explicitly state the license. However, it mentions that some base LLM models have commercial licenses, suggesting potential compatibility for commercial use depending on the chosen LLM.

Limitations & Caveats

Training is resource-intensive, requiring multiple high-end GPUs. While inference supports quantization, detailed performance benchmarks for various configurations are not provided. The project relies on external datasets that require manual download and organization.

LLMGA by JIA-Lab-research

Explore Similar Projects

SmartEdit by TencentARC

OmniGen2 by VectorSpaceLab

VARGPT by VARGPT-family

Awesome-LLMs-meet-Multimodal-Generation by YingqingHe

LLM-groundedDiffusion by TonyLianLong

awesome-comfyui by ComfyUI-Workflow

BLIP3o by JiuhaiChen

RPG-DiffusionMaster by YangLing0818

OFA by OFA-Sys

ml-mgie by apple

NExT-GPT by NExT-GPT

Omost by lllyasviel