instruct-pix2pix by timothybrooks

Image editing model for instruction-based image manipulation

Created 3 years ago

6,859 stars

Top 7.4% on SourcePulse

View on GitHub

12 Experts Love This Project

Alexander Borzunov

Research Scientist at OpenAI

Chaoyu Yang

Founder of Bento

Binyuan Hui

Research Scientist at Alibaba Qwen

Ben Firshman

Cofounder of Replicate

and 8 more!

Project Summary

This repository provides a PyTorch implementation of InstructPix2Pix, a model for instruction-based image editing. It allows users to edit images by providing natural language instructions, offering a powerful tool for creative image manipulation and content generation.

How It Works

InstructPix2Pix is fine-tuned from a Stable Diffusion checkpoint. It leverages a large, generated dataset of image-instruction-output triplets. The model's core innovation lies in its ability to interpret and apply textual editing instructions to modify input images, balancing adherence to the instruction with preservation of the original image's structure.

Quick Start & Requirements

Install dependencies and download checkpoints:

conda env create -f environment.yaml
conda activate ip2p
bash scripts/download_checkpoints.sh

Edit an image:

python edit_cli.py --input imgs/example.jpg --output imgs/output.jpg --edit "turn him into a cyborg"

Requires a GPU with >18GB VRAM for default settings.
Official Docs: https://github.com/timothybrooks/instruct-pix2pix

Highlighted Details

Trained on a dataset of 451,990 examples generated using GPT-3 and Stable Diffusion.
Offers fine-grained control over editing via cfg-text and cfg-image parameters.
Includes an interactive Gradio app for real-time editing.
Codebase is based on CompVis/stable_diffusion.

Maintenance & Community

Project initiated by Tim Brooks, Aleksander Holynski, and Alexei A. Efros from UC Berkeley.
Available on HuggingFace Spaces for browser-based demos and Replicate for API access.
Integrations available via imaginairy and Hugging Face diffusers.

Licensing & Compatibility

The specific license is not explicitly stated in the README, but the codebase is based on Stable Diffusion, which typically uses a permissive license. Compatibility for commercial use should be verified.

Limitations & Caveats

The default setup requires a high-end GPU (>18GB VRAM).
The quality of edits can be sensitive to parameter tuning (CFG scales, steps) and instruction phrasing.
Faces that are small in the input image may not be rendered well due to Stable Diffusion's autoencoder limitations.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

24 stars in the last 30 days