Step-Video-T2V  by stepfun-ai

Text-to-video model for generating high-fidelity, dynamic videos

created 5 months ago
3,083 stars

Top 15.9% on sourcepulse

GitHubView on GitHub
Project Summary

Step-Video-T2V is a 30-billion parameter text-to-video model capable of generating videos up to 204 frames. It targets researchers and developers in AI video generation, offering state-of-the-art quality and efficiency through novel compression and optimization techniques.

How It Works

The model utilizes a deep compression Video-VAE for 16x16 spatial and 8x temporal compression, significantly improving training and inference efficiency. Video generation is handled by a Diffusion Transformer (DiT) with 3D full attention, conditioned on text embeddings from bilingual encoders and timesteps. Direct Preference Optimization (DPO) is applied in the final stage to enhance visual quality and reduce artifacts, leading to smoother, more realistic outputs.

Quick Start & Requirements

  • Install: pip install -e . after cloning the repository. flash-attn is optional.
  • Prerequisites: Python >= 3.10, PyTorch >= 2.3-cu121, CUDA Toolkit, FFmpeg. Requires NVIDIA GPU with CUDA support. Text encoder self-attention requires CUDA capabilities sm_80, sm_86, or sm_90.
  • Hardware: Peak GPU memory ranges from 72.48 GB to 78.55 GB for generating 204 frames at 768x768 resolution. Recommended: 80GB GPU. Tested on four GPUs.
  • Links: Huggingface, Modelscope, Technical Report, Online Engine

Highlighted Details

  • Generates videos up to 204 frames with 16x16 spatial and 8x temporal compression.
  • Employs a DiT architecture with 3D full attention and 3D RoPE for handling varying video lengths.
  • Incorporates Direct Preference Optimization (DPO) for enhanced visual quality and artifact reduction.
  • Evaluated on a novel benchmark, Step-Video-T2V-Eval, featuring 128 Chinese prompts.

Maintenance & Community

  • Code will be integrated into the official Huggingface/Diffusers repository.
  • Collaboration with the FastVideo team for inference acceleration solutions.

Licensing & Compatibility

  • License details are not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The self-attention in the text-encoder has specific CUDA capability requirements (sm_80, sm_86, sm_90). Multi-GPU inference requires a decoupling strategy, with dedicated GPUs for text encoder and VAE decoding services. Single-GPU inference and quantization are available via the DiffSynth-Studio project.

Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
189 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

CogVideo by zai-org

0.4%
12k
Text-to-video generation models (CogVideoX, CogVideo)
created 3 years ago
updated 1 month ago
Feedback? Help us improve.