Image editing via multimodal LLMs (research paper)
Top 12.8% on sourcepulse
This repository provides the code for MGIE (Multimodal Large Language Model-Guided Image Editing), a method that leverages multimodal large language models (MLLMs) to enhance instruction-based image editing. It addresses the challenge of brief or ambiguous user instructions by enabling MLLMs to derive more expressive guidance, leading to more controllable and flexible image manipulation for researchers and practitioners in computer vision and natural language processing.
How It Works
MGIE integrates MLLMs with image editing models. The core idea is to use the MLLM to interpret user instructions and generate richer, visually-grounded editing commands. These derived instructions then guide an end-to-end trained image manipulation model, allowing for more precise and nuanced edits based on natural language prompts.
Quick Start & Requirements
conda
environment setup with Python 3.10. Key dependencies include PyTorch (cu113), transformers, diffusers, Gradio, DeepSpeed, and FlashAttention. The LLaVA codebase needs to be cloned and installed.cython==0.29.36
, pydantic==1.10
)._ckpt/LLaVA-7B-v1
, _ckpt/mgie_7b
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project's weights are licensed under CC-BY-NC, prohibiting commercial use. The setup involves complex dependency management and specific checkpoint requirements, potentially increasing adoption friction.
1 year ago
Inactive