VGen by ali-vilab

Video synthesis codebase for state-of-the-art generative models

Created 2 years ago

3,150 stars

Top 15.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Jiaming Song

Chief Scientist at Luma AI

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

VGen is a comprehensive open-source codebase for video generation, offering implementations of state-of-the-art diffusion models for various synthesis tasks. It caters to researchers and developers in AI video generation, providing tools for training, inference, and customization with a focus on high-quality output and controllability.

How It Works

VGen leverages cascaded diffusion models and hierarchical spatio-temporal decoupling to achieve high-quality video synthesis. It supports text-to-video, image-to-video, and controllable generation based on motion and subject customization. The ecosystem is designed for expandability and includes components for managing experiments and integrating various diffusion model architectures.

Quick Start & Requirements

Install: conda create -n vgen python=3.8, conda activate vgen, pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113, pip install -r requirements.txt.
Prerequisites: Python 3.8, PyTorch 1.12.0 with CUDA 11.3, ffmpeg, libsm6, libxext6.
Setup: Requires cloning the repository and installing dependencies.
Docs: Modelscope T2V Technical Report, I2VGen-XL Paper.

Highlighted Details

Implements multiple advanced models: I2VGen-xl, VideoComposer, HiGen, InstructVideo, DreamVideo, VideoLCM, TF-T2V.
Supports customization via LoRA fine-tuning, subject learning, and motion learning.
Includes tools for metric calculation (CLIP-T, CLIP-I, DINO-I, Temporal Consistency).
Offers Gradio demos for local testing and HuggingFace/ModelScope integration.

Maintenance & Community

Developed by Tongyi Lab of Alibaba Group.
Active development with frequent releases of new models and features (e.g., InstructVideo, DreamVideo, VideoLCM).
Links to relevant papers and technical reports are provided.

Licensing & Compatibility

License: Trained with WebVid-10M and LAION-400M datasets. Intended for RESEARCH/NON-COMMERCIAL USE ONLY.

Limitations & Caveats

The current models perform inadequately on anime images and images with black backgrounds due to limited training data.
Super-resolution models for TF-T2V only support 32-frame input, requiring frame duplication for 16-frame videos.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days