Ditto by EzioBy

Scaling instruction-based video editing with synthetic data

Created 2 months ago

552 stars

Top 57.9% on SourcePulse

Project Summary

Ditto addresses the critical data scarcity challenge in instruction-based video editing by introducing a scalable pipeline for generating high-quality synthetic data. This framework enables the training of state-of-the-art models like Editto, offering researchers and practitioners a robust solution for advanced video manipulation. The primary benefit is enabling high-fidelity, instruction-driven video edits at scale, overcoming limitations of existing datasets and models.

How It Works

Ditto employs a novel data generation pipeline that synergizes the creative diversity of image editing tools with an in-context video generator. To manage the cost-quality trade-off, it utilizes an efficient, distilled model architecture enhanced by a temporal enhancer for improved coherence and reduced computational load. An intelligent agent drives the process, generating diverse instructions and ensuring rigorous quality control for scalable data production. The resulting Ditto-1M dataset comprises one million high-fidelity video editing examples.

Quick Start & Requirements

Installation involves creating a Conda environment (python=3.10), activating it, and running pip install -e .. Users must download base models (e.g., Wan-AI/Wan2.1-VACE-14B) and Ditto-specific models from Hugging Face or Google Drive. Inference can be performed via the infer.sh script or python inference/infer_ditto.py, requiring input/output video paths, a prompt, LoRA path, and device ID. ComfyUI integration is also supported, requiring its setup and specific custom nodes. Links to the paper, project page, model weights, and dataset are provided.

Highlighted Details

Ditto-1M: A dataset of one million high-fidelity synthetic video editing examples.
Editto: Achieves state-of-the-art performance in instruction-based video editing.
Scalable Data Generation: A holistic framework designed for efficient, high-quality synthetic data creation.
Efficient Architecture: Combines distilled models with temporal enhancement for reduced overhead and improved coherence.

Maintenance & Community

The project is associated with academic researchers and leverages foundational models like Wan, VACE, and QwenVL. The codebase is based on DiffSynth-Studio. No specific community channels (e.g., Discord, Slack) or explicit roadmap details are provided in the README.

Licensing & Compatibility

The project is licensed under CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License). This license restricts usage to academic research purposes and prohibits commercial use.

Limitations & Caveats

The code is explicitly provided for academic research purposes only, with a strict non-commercial use clause. Integration via ComfyUI may result in some quality degradation due to the use of quantized and distilled models. As the associated paper is a preprint (2025), the project may represent ongoing research.

Ditto by EzioBy

Explore Similar Projects

RAVE by RehgLab

ChronoEdit by nv-tlabs

LTX-Video-Trainer by Lightricks

JJYB_AI_VideoAutoCut by jianjieyiban

LightX2V by ModelTC

VideoX-Fun by aigc-apps

Google-Colab_Notebooks by Isi-dev

EasyAnimate by aigc-apps

LTX-2 by Lightricks

FastVideo by hao-ai-lab

Step-Video-T2V by stepfun-ai

LTX-Video by Lightricks