Kiwi-Edit  by showlab

Versatile video editing via natural language instructions and references

Created 2 months ago
268 stars

Top 95.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Kiwi-Edit is a unified, open-source framework for advanced video editing guided by natural language instructions and reference images. It targets researchers and power users seeking flexible video manipulation capabilities, enabling tasks like style transfer, object manipulation, and background replacement through intuitive text prompts.

How It Works

The framework leverages a Multi-modal Large Language Model (MLLM) encoder combined with a video Diffusion Transformer (DiT) architecture. This approach allows for sophisticated understanding of textual instructions and visual references, facilitating precise video modifications. Its advantage lies in seamlessly integrating both instruction-only and reference-guided editing paradigms within a single, versatile system.

Quick Start & Requirements

  • Primary Install: Requires Python 3.10 and CUDA 12.8. Installation involves setting up a Conda environment, installing specific PyTorch (2.7) and Accelerate versions, followed by the project's dependencies (pip install -e ., DeepSpeed, FlashAttention, transformers, huggingface-hub, wandb). An alternative install_full_env.sh script is provided.
  • Diffusers Inference: A separate environment can be set up using conda create -n diffusers python=3.10 -y and installing diffusers, decord, einops, accelerate, transformers==4.57.0, opencv-python, av.
  • Prerequisites: Base weights (e.g., Wan-AI/Wan2.2-TI2V-5B) must be downloaded via Hugging Face Hub.
  • Quick Test: Demo execution commands are provided for both the full environment (bash demo.py ...) and the Diffusers setup (python diffusers_demo.py ...).
  • Links: 🌐 Project Page | 📑 Paper | 🤗 Models | 🤗 Datasets | 🤗 Demo

Highlighted Details

  • Supports diverse editing tasks: style application, object addition/removal/replacement, and background modification via natural language.
  • Enables "Subject Reference" editing (e.g., adding accessories) and "Background Reference" editing (e.g., applying artistic styles).
  • Offers multiple pre-trained models fine-tuned for instruction-only, reference-only, or combined editing scenarios on Hugging Face.
  • Includes comprehensive training scripts and links to evaluation benchmark datasets like OpenVE-Bench and RefVIE-Bench.

Maintenance & Community

The project authors include Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. No specific community channels (e.g., Discord, Slack) or roadmap details are provided in the README.

Licensing & Compatibility

The license type is not explicitly stated in the provided README. This omission requires further investigation for commercial use or closed-source integration compatibility. The project relies on standard deep learning libraries like PyTorch, Accelerate, and Hugging Face Transformers/Diffusers.

Limitations & Caveats

Strict environment requirements include Python 3.10, CUDA 12.8, and PyTorch 2.7. Installation involves multiple manual steps and downloading large model weights. The absence of a specified license poses a significant adoption blocker for many use cases. Gemini-based evaluation scripts necessitate careful API key management.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
32 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.