Pusa-VidGen by Yaofang-Liu

Video diffusion model with vectorized timestep adaptation

Created 9 months ago

671 stars

Top 50.4% on SourcePulse

Project Summary

Pusa-VidGen introduces a novel vectorized timestep adaptation (VTA) technique for video diffusion models, enabling fine-grained temporal control and multi-task capabilities with unprecedented efficiency. Targeting researchers and developers in AI video generation, it offers significant cost and dataset reductions compared to existing state-of-the-art models.

How It Works

Pusa employs frame-level noise control via vectorized timesteps, a departure from traditional scalar timestep methods. This approach, detailed in the FVDM paper, allows for non-destructive adaptation of base models like Wan-Video and Mochi, preserving their original capabilities while enabling new functionalities such as image-to-video, start-end frame generation, video extension, and transitions without task-specific training.

Quick Start & Requirements

Install: Clone the repository, cd into it, and use uv for installation:

git clone https://github.com/genmoai/models
cd models
pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install setuptools
uv pip install -e . --no-build-isolation

For Flash Attention: uv pip install -e .[flash] --no-build-isolation

Weights: Download from Hugging Face CLI (huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir) or directly.
Prerequisites: Python, uv, and potentially multiple GPUs for example usage.
Docs: Pusa V1.0 README

Highlighted Details

Achieves 87.32% VBench-I2V score, surpassing Wan-I2V-14B.
Training cost: ≤ $500 vs. ≥ $100,000 for Wan-I2V-14B.
Dataset size: ≤ 4K samples vs. ≥ 10M samples.
Supports Text-to-Video, Image-to-Video, Start-End Frames, Video Extension, and Video Transition.

Maintenance & Community

The project released V1.0 in July 2025, based on Wan-Video models, with code, technical report, and dataset. V0.5, based on Mochi, was released earlier with inference scripts. The project welcomes collaboration.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Video generation quality is dependent on the base model used (e.g., Wan-T2V-14B for V1.0). The project anticipates further quality improvements with more advanced base models and welcomes community contributions.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days