SmartEdit by TencentARC

Research paper for complex instruction-based image editing using multimodal LLMs

Created 2 years ago

369 stars

Top 76.5% on SourcePulse

Project Summary

SmartEdit addresses complex, instruction-based image editing by leveraging multimodal large language models (MLLMs). It enables users to perform intricate edits through natural language, targeting researchers and practitioners in computer vision and generative AI. The framework aims to provide more nuanced control over image manipulation than existing methods.

How It Works

SmartEdit employs a two-stage training process. Stage 1 aligns a vision encoder with a large language model (LLM) using a large-scale dataset (CC12M) to establish a foundational understanding of image-text relationships. Stage 2 fine-tunes this aligned model on specific image editing datasets (InstructPix2Pix, MagicBrush, RefCOCO, etc.) and a synthetic dataset, integrating a diffusion model for the actual image generation. This approach allows the LLM to interpret complex editing instructions and guide the diffusion process effectively.

Quick Start & Requirements

Installation: pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118 followed by pip install -r requirements.txt and installing flash-attention.
Prerequisites: CUDA 11.8, PyTorch 2.1.0, Vicuna-1.1-7B/13B checkpoints, LLaVA-1.1-7B/13B checkpoints, InstructDiffusion checkpoint, and various datasets (CC12M, InstructPix2Pix, MagicBrush, RefCOCO, LISA, ReasonSeg).
Setup: Requires downloading multiple large model checkpoints and datasets, with training scripts provided for both 7B and 13B parameter models.
Links: Paper, Project Page

Highlighted Details

CVPR-2024 Highlight paper.
Supports both "understanding" and "reasoning" scenarios for image editing.
Introduces new vocabulary embeddings for enhanced image-text interaction within the MLLM.
Offers inference scripts for different resolutions (256x256, 384x384).

Maintenance & Community

The project is associated with Tencent ARC and has contributions from multiple authors listed in the CVPR paper. Contact emails are provided for inquiries.

Licensing & Compatibility

The repository does not explicitly state a license. The underlying models (Vicuna, LLaVA) have their own licenses, which may impose restrictions on commercial use or redistribution. Users must verify compatibility with their intended use cases.

Limitations & Caveats

The "add" functionality (e.g., "Add a smaller elephant") is listed as a future release item. Users who downloaded checkpoints before April 28, 2024, may need to re-download specific files. Preparation of LLaVA checkpoints requires careful handling of LLaMA delta weights due to policy protection.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days