Multimodal LLM for image generation/editing, leveraging LLMs for detailed prompts
Top 74.0% on sourcepulse
LLMGA is a multimodal large language model-based generation assistant designed to help users create and edit images through conversational interactions. It targets researchers and developers interested in advanced image generation and editing capabilities, offering precise control over Stable Diffusion models via detailed language prompts.
How It Works
LLMGA leverages Large Language Models (LLMs) to generate detailed textual prompts that control Stable Diffusion (SD) models for image generation and editing tasks like text-to-image, inpainting, and outpainting. This approach enhances LLM context understanding, reduces prompt noise, and improves image detail and interpretability compared to methods using fixed-size embeddings. A two-stage training scheme is employed: first, training the MLLM to generate prompts, and second, optimizing SD to align with these prompts. A reference-based restoration network is also proposed to mitigate visual disparities in inpainting/outpainting.
Quick Start & Requirements
conda
environment (conda create -n llmga python=3.9
), activate it, navigate to the repo, and install dependencies using pip install -e .
and pip install .
in the llmga/diffusers
directory. Additional packages for training (pip install -e ".[train]"
, pip install -r requirements.txt
, pip install flash-attn
, pip install datasets
, pip install albumentations
, pip install ninja
) are recommended for full functionality.conda
, PyTorch. Training requires 8x A100 GPUs (80GB). Inference supports multiple GPUs and 4-bit/8-bit quantization.Highlighted Details
Maintenance & Community
The project is actively updated, with recent releases in July 2024 including new models, datasets, and code updates. It is associated with dvlab-research.
Licensing & Compatibility
The README does not explicitly state the license. However, it mentions that some base LLM models have commercial licenses, suggesting potential compatibility for commercial use depending on the chosen LLM.
Limitations & Caveats
Training is resource-intensive, requiring multiple high-end GPUs. While inference supports quantization, detailed performance benchmarks for various configurations are not provided. The project relies on external datasets that require manual download and organization.
2 months ago
1 day