ChronoEdit  by nv-tlabs

Image editing via temporal reasoning and video generation

Created 2 months ago
344 stars

Top 80.4% on SourcePulse

GitHubView on GitHub
Project Summary

ChronoEdit addresses the challenge of generating temporally consistent and physically plausible edits in images by reframing the task as a video generation problem. It leverages pretrained video diffusion models, using input and edited images as start and end frames. The project targets researchers and developers in AI-driven image and video manipulation, offering a novel approach to achieve more realistic and controllable image editing trajectories.

How It Works

ChronoEdit treats image editing as a short video generation task, utilizing the temporal consistency inherent in pretrained video models. A key innovation is the introduction of "reasoning tokens" during a temporal reasoning stage. These tokens enable the model to understand and enforce physical plausibility throughout the editing process, visualizing the trajectory from the initial state to the final edited image. This method allows for complex edits that maintain coherence over time.

Quick Start & Requirements

  • Installation: Clone the repository, create a Conda environment (conda env create -f environment.yml -n chronoedit_mini), activate it (conda activate chronoedit_mini), and install dependencies (pip install torch==2.7.1 torchvision==0.22.1, pip install -r requirements_minimal.txt). Optional: pip install flash-attn==2.6.3 for faster inference.
  • Prerequisites: Linux OS, Python 3.10. Requires ~34GB GPU memory for inference, increasing to ~38GB with temporal reasoning enabled. Using the recommended prompt enhancer (Qwen/Qwen3-VL-30B-A3B-Instruct) can require up to 60GB peak memory.
  • Models: Download ChronoEdit-14B-Diffusers from HuggingFace.
  • Links: Project Page (implied by repo), ChronoEdit-14B Model, Live Demo.

Highlighted Details

  • Temporal reasoning stage for physically plausible edits and visualization of editing trajectories.
  • Optional prompt enhancer (e.g., Qwen/Qwen3-VL-30B-A3B-Instruct) for improved edit quality.
  • Support for LoRA finetuning via Diffsynth-Studio.
  • Release of full training infrastructure and codebase for distributed inference and large-scale fine-tuning.
  • Automated dataset generation script using vision-language models and Chain-of-Thought reasoning.

Maintenance & Community

The project acknowledges contributions from NVIDIA teams and specific researchers. No explicit community channels (e.g., Discord, Slack) or roadmap links are provided in the README excerpt.

Licensing & Compatibility

The license type is not specified in the provided README content. This omission requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

The system is restricted to Linux environments. Inference demands significant GPU memory (34GB+), with higher requirements when using the prompt enhancer. The project appears to be actively developed, with recent releases of models and demos.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
10
Star History
341 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.