ml-mgie by apple

Image editing via multimodal LLMs (research paper)

Created 2 years ago

3,889 stars

Top 12.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

This repository provides the code for MGIE (Multimodal Large Language Model-Guided Image Editing), a method that leverages multimodal large language models (MLLMs) to enhance instruction-based image editing. It addresses the challenge of brief or ambiguous user instructions by enabling MLLMs to derive more expressive guidance, leading to more controllable and flexible image manipulation for researchers and practitioners in computer vision and natural language processing.

How It Works

MGIE integrates MLLMs with image editing models. The core idea is to use the MLLM to interpret user instructions and generate richer, visually-grounded editing commands. These derived instructions then guide an end-to-end trained image manipulation model, allowing for more precise and nuanced edits based on natural language prompts.

Quick Start & Requirements

Installation: Requires conda environment setup with Python 3.10. Key dependencies include PyTorch (cu113), transformers, diffusers, Gradio, DeepSpeed, and FlashAttention. The LLaVA codebase needs to be cloned and installed.
Prerequisites: CUDA 11.3, Git LFS, FFmpeg, Ninja, GPUtil, and specific versions of Python packages (e.g., cython==0.29.36, pydantic==1.10).
Setup: Download pre-trained LLaVA-7B and MGIE checkpoints into specified directories (_ckpt/LLaVA-7B-v1, _ckpt/mgie_7b).
Resources: Training involves distributed training across 8 GPUs.
Links: demo.ipynb for inference and demonstration.

Highlighted Details

Implements "Guiding Instruction-based Image Editing via Multimodal Large Language Models" (ICLR'24 Spotlight).
Leverages MLLMs to generate expressive instructions for image editing.
End-to-end training framework for joint visual imagination and manipulation.
Requires specific checkpoint placements for LLaVA and MGIE.

Maintenance & Community

Developed by Apple.
Built upon the LLaVA codebase.
No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

Weight differentials are licensed under CC-BY-NC.
Apple disclaims responsibility for third-party software (e.g., LLaMa) and their terms.
The CC-BY-NC license restricts commercial use.

Limitations & Caveats

The project's weights are licensed under CC-BY-NC, prohibiting commercial use. The setup involves complex dependency management and specific checkpoint requirements, potentially increasing adoption friction.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days