ChronoEdit by nv-tlabs

Image editing via temporal reasoning and video generation

Created 4 months ago

649 stars

Top 51.5% on SourcePulse

Project Summary

ChronoEdit addresses the challenge of generating temporally consistent and physically plausible edits in images by reframing the task as a video generation problem. It leverages pretrained video diffusion models, using input and edited images as start and end frames. The project targets researchers and developers in AI-driven image and video manipulation, offering a novel approach to achieve more realistic and controllable image editing trajectories.

How It Works

ChronoEdit treats image editing as a short video generation task, utilizing the temporal consistency inherent in pretrained video models. A key innovation is the introduction of "reasoning tokens" during a temporal reasoning stage. These tokens enable the model to understand and enforce physical plausibility throughout the editing process, visualizing the trajectory from the initial state to the final edited image. This method allows for complex edits that maintain coherence over time.

Quick Start & Requirements

Installation: Clone the repository, create a Conda environment (conda env create -f environment.yml -n chronoedit_mini), activate it (conda activate chronoedit_mini), and install dependencies (pip install torch==2.7.1 torchvision==0.22.1, pip install -r requirements_minimal.txt). Optional: pip install flash-attn==2.6.3 for faster inference.
Prerequisites: Linux OS, Python 3.10. Requires ~34GB GPU memory for inference, increasing to ~38GB with temporal reasoning enabled. Using the recommended prompt enhancer (Qwen/Qwen3-VL-30B-A3B-Instruct) can require up to 60GB peak memory.
Models: Download ChronoEdit-14B-Diffusers from HuggingFace.
Links: Project Page (implied by repo), ChronoEdit-14B Model, Live Demo.

Highlighted Details

Temporal reasoning stage for physically plausible edits and visualization of editing trajectories.
Optional prompt enhancer (e.g., Qwen/Qwen3-VL-30B-A3B-Instruct) for improved edit quality.
Support for LoRA finetuning via Diffsynth-Studio.
Release of full training infrastructure and codebase for distributed inference and large-scale fine-tuning.
Automated dataset generation script using vision-language models and Chain-of-Thought reasoning.

Maintenance & Community

The project acknowledges contributions from NVIDIA teams and specific researchers. No explicit community channels (e.g., Discord, Slack) or roadmap links are provided in the README excerpt.

Licensing & Compatibility

The license type is not specified in the provided README content. This omission requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

The system is restricted to Linux environments. Inference demands significant GPU memory (34GB+), with higher requirements when using the prompt enhancer. The project appears to be actively developed, with recent releases of models and demos.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

31 stars in the last 30 days