ml-mgie  by apple

Image editing via multimodal LLMs (research paper)

created 1 year ago
3,890 stars

Top 12.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides the code for MGIE (Multimodal Large Language Model-Guided Image Editing), a method that leverages multimodal large language models (MLLMs) to enhance instruction-based image editing. It addresses the challenge of brief or ambiguous user instructions by enabling MLLMs to derive more expressive guidance, leading to more controllable and flexible image manipulation for researchers and practitioners in computer vision and natural language processing.

How It Works

MGIE integrates MLLMs with image editing models. The core idea is to use the MLLM to interpret user instructions and generate richer, visually-grounded editing commands. These derived instructions then guide an end-to-end trained image manipulation model, allowing for more precise and nuanced edits based on natural language prompts.

Quick Start & Requirements

  • Installation: Requires conda environment setup with Python 3.10. Key dependencies include PyTorch (cu113), transformers, diffusers, Gradio, DeepSpeed, and FlashAttention. The LLaVA codebase needs to be cloned and installed.
  • Prerequisites: CUDA 11.3, Git LFS, FFmpeg, Ninja, GPUtil, and specific versions of Python packages (e.g., cython==0.29.36, pydantic==1.10).
  • Setup: Download pre-trained LLaVA-7B and MGIE checkpoints into specified directories (_ckpt/LLaVA-7B-v1, _ckpt/mgie_7b).
  • Resources: Training involves distributed training across 8 GPUs.
  • Links: demo.ipynb for inference and demonstration.

Highlighted Details

  • Implements "Guiding Instruction-based Image Editing via Multimodal Large Language Models" (ICLR'24 Spotlight).
  • Leverages MLLMs to generate expressive instructions for image editing.
  • End-to-end training framework for joint visual imagination and manipulation.
  • Requires specific checkpoint placements for LLaVA and MGIE.

Maintenance & Community

  • Developed by Apple.
  • Built upon the LLaVA codebase.
  • No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • Weight differentials are licensed under CC-BY-NC.
  • Apple disclaims responsibility for third-party software (e.g., LLaMa) and their terms.
  • The CC-BY-NC license restricts commercial use.

Limitations & Caveats

The project's weights are licensed under CC-BY-NC, prohibiting commercial use. The setup involves complex dependency management and specific checkpoint requirements, potentially increasing adoption friction.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
24 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.