void-model  by Netflix

AI-powered video object and interaction deletion

Created 2 weeks ago

New!

1,429 stars

Top 28.1% on SourcePulse

GitHubView on GitHub
Project Summary

VOID (Video Object and Interaction Deletion) addresses advanced video inpainting by removing objects and their induced physical interactions. It targets researchers and engineers in video processing, offering a novel method to simulate consequential physical effects, such as objects falling when a supporting element is deleted, for realistic video manipulation.

How It Works

Built on CogVideoX and fine-tuned for video inpainting, VOID uses a two-pass transformer architecture. Pass 1 handles base inpainting, while Pass 2 refines output with warped-noise for temporal consistency. Its core innovation is "quadmask" conditioning, encoding object, overlap, affected, and background regions to simulate physical consequences. The mask generation pipeline integrates SAM2 segmentation and Gemini VLM for interaction reasoning.

Quick Start & Requirements

The fastest way to try VOID is via an included Jupyter notebook. For manual setup, installation requires pip install -r requirements.txt. Key prerequisites include a GPU with 40GB+ VRAM (e.g., A100), setting a GEMINI_API_KEY for mask generation, and separately installing SAM2. Pretrained CogVideoX and VOID models must be downloaded from HuggingFace. A Gradio demo is available at https://huggingface.co/spaces/sam-motamed/VOID.

Highlighted Details

  • Two-pass inference (base inpainting + warped-noise refinement) for enhanced temporal consistency.
  • Interaction-aware quadmask conditioning simulates physical consequences.
  • Mask generation leverages SAM2 and Gemini VLM for scene understanding.
  • Handles complex physical interactions like falling objects.

Maintenance & Community

The project encourages community adoption and contributions, inviting PRs for demos and extensions. A Gradio demo is provided by a core contributor. Specific community channels or roadmaps are not detailed.

Licensing & Compatibility

The code's license is not explicitly stated. However, licensing constraints on underlying training datasets mean only data generation code is released. Commercial use or closed-source linking compatibility is not specified.

Limitations & Caveats

Inference demands a high-end GPU with at least 40GB VRAM. Training data generation is complex, requiring access requests for datasets like HUMOTO and dependencies such as Blender.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
7
Issues (30d)
9
Star History
1,444 stars in the last 16 days

Explore Similar Projects

Feedback? Help us improve.