Ditto  by EzioBy

Scaling instruction-based video editing with synthetic data

Created 2 weeks ago

New!

431 stars

Top 68.8% on SourcePulse

GitHubView on GitHub
Project Summary

Ditto addresses the critical data scarcity challenge in instruction-based video editing by introducing a scalable pipeline for generating high-quality synthetic data. This framework enables the training of state-of-the-art models like Editto, offering researchers and practitioners a robust solution for advanced video manipulation. The primary benefit is enabling high-fidelity, instruction-driven video edits at scale, overcoming limitations of existing datasets and models.

How It Works

Ditto employs a novel data generation pipeline that synergizes the creative diversity of image editing tools with an in-context video generator. To manage the cost-quality trade-off, it utilizes an efficient, distilled model architecture enhanced by a temporal enhancer for improved coherence and reduced computational load. An intelligent agent drives the process, generating diverse instructions and ensuring rigorous quality control for scalable data production. The resulting Ditto-1M dataset comprises one million high-fidelity video editing examples.

Quick Start & Requirements

Installation involves creating a Conda environment (python=3.10), activating it, and running pip install -e .. Users must download base models (e.g., Wan-AI/Wan2.1-VACE-14B) and Ditto-specific models from Hugging Face or Google Drive. Inference can be performed via the infer.sh script or python inference/infer_ditto.py, requiring input/output video paths, a prompt, LoRA path, and device ID. ComfyUI integration is also supported, requiring its setup and specific custom nodes. Links to the paper, project page, model weights, and dataset are provided.

Highlighted Details

  • Ditto-1M: A dataset of one million high-fidelity synthetic video editing examples.
  • Editto: Achieves state-of-the-art performance in instruction-based video editing.
  • Scalable Data Generation: A holistic framework designed for efficient, high-quality synthetic data creation.
  • Efficient Architecture: Combines distilled models with temporal enhancement for reduced overhead and improved coherence.

Maintenance & Community

The project is associated with academic researchers and leverages foundational models like Wan, VACE, and QwenVL. The codebase is based on DiffSynth-Studio. No specific community channels (e.g., Discord, Slack) or explicit roadmap details are provided in the README.

Licensing & Compatibility

The project is licensed under CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License). This license restricts usage to academic research purposes and prohibits commercial use.

Limitations & Caveats

The code is explicitly provided for academic research purposes only, with a strict non-commercial use clause. Integration via ComfyUI may result in some quality degradation due to the use of quantized and distilled models. As the associated paper is a preprint (2025), the project may represent ongoing research.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
20
Star History
438 stars in the last 16 days

Explore Similar Projects

Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI) and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

LightX2V by ModelTC

6.3%
741
Video generation inference framework for efficient synthesis
Created 7 months ago
Updated 15 hours ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

FastVideo by hao-ai-lab

1.2%
3k
Framework for accelerated video generation
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.