Pusa-VidGen  by Yaofang-Liu

Video diffusion model with vectorized timestep adaptation

Created 7 months ago
660 stars

Top 50.8% on SourcePulse

GitHubView on GitHub
Project Summary

Pusa-VidGen introduces a novel vectorized timestep adaptation (VTA) technique for video diffusion models, enabling fine-grained temporal control and multi-task capabilities with unprecedented efficiency. Targeting researchers and developers in AI video generation, it offers significant cost and dataset reductions compared to existing state-of-the-art models.

How It Works

Pusa employs frame-level noise control via vectorized timesteps, a departure from traditional scalar timestep methods. This approach, detailed in the FVDM paper, allows for non-destructive adaptation of base models like Wan-Video and Mochi, preserving their original capabilities while enabling new functionalities such as image-to-video, start-end frame generation, video extension, and transitions without task-specific training.

Quick Start & Requirements

  • Install: Clone the repository, cd into it, and use uv for installation:
    git clone https://github.com/genmoai/models
    cd models
    pip install uv
    uv venv .venv
    source .venv/bin/activate
    uv pip install setuptools
    uv pip install -e . --no-build-isolation
    
    For Flash Attention: uv pip install -e .[flash] --no-build-isolation
  • Weights: Download from Hugging Face CLI (huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir) or directly.
  • Prerequisites: Python, uv, and potentially multiple GPUs for example usage.
  • Docs: Pusa V1.0 README

Highlighted Details

  • Achieves 87.32% VBench-I2V score, surpassing Wan-I2V-14B.
  • Training cost: ≤ $500 vs. ≥ $100,000 for Wan-I2V-14B.
  • Dataset size: ≤ 4K samples vs. ≥ 10M samples.
  • Supports Text-to-Video, Image-to-Video, Start-End Frames, Video Extension, and Video Transition.

Maintenance & Community

The project released V1.0 in July 2025, based on Wan-Video models, with code, technical report, and dataset. V0.5, based on Mochi, was released earlier with inference scripts. The project welcomes collaboration.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Video generation quality is dependent on the base model used (e.g., Wan-T2V-14B for V1.0). The project anticipates further quality improvements with more advanced base models and welcomes community contributions.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

FastVideo by hao-ai-lab

1.2%
3k
Framework for accelerated video generation
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.