Gen-L-Video by G-U-N

Video generation research paper using temporal co-denoising

Created 2 years ago

305 stars

Top 87.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

Gen-L-Video provides a universal methodology for extending existing short video diffusion models to generate and edit long videos with multi-text conditioning. It addresses the limitations of current models that are restricted to short clips and single text prompts, enabling applications requiring longer, semantically diverse video content without additional training.

How It Works

Gen-L-Video employs a temporal co-denoising approach to bridge short video generation capabilities to longer sequences. It effectively creates an abstract long video generator and editor by leveraging off-the-shelf short video diffusion models. This allows for the generation and editing of videos with hundreds of frames and diverse semantic segments while maintaining content consistency, all without requiring further model training.

Quick Start & Requirements

Installation: Clone the repository, create a Conda environment (conda env create -f requirements.yml), activate it (conda activate glv), and install PyTorch with CUDA 11.6. Install Xformers, Segment Anything (SAM), and Grounding DINO via pip or by cloning their respective repositories.
Pretrained Weights: Download necessary pretrained models using bash scripts/download_pretrained_models.sh. Paths to these weights must be specified in configuration files.
Dependencies: Python 3.8+, PyTorch 1.13.1, torchvision 0.14.1, torchaudio 0.13.1, CUDA 11.6, git-lfs. Requires significant disk space for cloned repositories and downloaded weights.
Hardware: A single RTX 3090 is sufficient for most results.
Documentation: Gen-L-Video GitHub

Highlighted Details

Enables multi-text conditioned long video generation and editing.
Achieves generation and editing of videos with hundreds of frames and diverse semantic segments.
Operates without requiring additional training on top of existing short video diffusion models.
Supports various control mechanisms including pose, depth, segmentation, and sketch.

Maintenance & Community

The project is based on numerous other open-source projects including diffusers, Tune-A-Video, Stable-Diffusion, ControlNet, and GroundingDINO. The primary author is Fu-Yun Wang. Further community interaction can be found via GitHub issues and discussions.

Licensing & Compatibility

The repository's code is likely governed by the licenses of its dependencies. Specific licensing for Gen-L-Video itself is not explicitly stated in the README, but it heavily relies on models and codebases with various licenses (e.g., Stable Diffusion, ControlNet). Compatibility for commercial use would require careful review of all underlying component licenses.

Limitations & Caveats

The README mentions that Gen-L^2 is a better-performing alternative. The initial repository clone may be very large due to included GIFs. Some installation steps, particularly for Xformers and Grounding DINO, can be time-consuming and may require specific CUDA environment configurations.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days