Research paper for complex instruction-based image editing using multimodal LLMs
Top 81.1% on sourcepulse
SmartEdit addresses complex, instruction-based image editing by leveraging multimodal large language models (MLLMs). It enables users to perform intricate edits through natural language, targeting researchers and practitioners in computer vision and generative AI. The framework aims to provide more nuanced control over image manipulation than existing methods.
How It Works
SmartEdit employs a two-stage training process. Stage 1 aligns a vision encoder with a large language model (LLM) using a large-scale dataset (CC12M) to establish a foundational understanding of image-text relationships. Stage 2 fine-tunes this aligned model on specific image editing datasets (InstructPix2Pix, MagicBrush, RefCOCO, etc.) and a synthetic dataset, integrating a diffusion model for the actual image generation. This approach allows the LLM to interpret complex editing instructions and guide the diffusion process effectively.
Quick Start & Requirements
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
followed by pip install -r requirements.txt
and installing flash-attention
.Highlighted Details
Maintenance & Community
The project is associated with Tencent ARC and has contributions from multiple authors listed in the CVPR paper. Contact emails are provided for inquiries.
Licensing & Compatibility
The repository does not explicitly state a license. The underlying models (Vicuna, LLaVA) have their own licenses, which may impose restrictions on commercial use or redistribution. Users must verify compatibility with their intended use cases.
Limitations & Caveats
The "add" functionality (e.g., "Add a smaller elephant") is listed as a future release item. Users who downloaded checkpoints before April 28, 2024, may need to re-download specific files. Preparation of LLaVA checkpoints requires careful handling of LLaMA delta weights due to policy protection.
1 year ago
1 week