Text-to-video model for generating high-fidelity, dynamic videos
Top 15.9% on sourcepulse
Step-Video-T2V is a 30-billion parameter text-to-video model capable of generating videos up to 204 frames. It targets researchers and developers in AI video generation, offering state-of-the-art quality and efficiency through novel compression and optimization techniques.
How It Works
The model utilizes a deep compression Video-VAE for 16x16 spatial and 8x temporal compression, significantly improving training and inference efficiency. Video generation is handled by a Diffusion Transformer (DiT) with 3D full attention, conditioned on text embeddings from bilingual encoders and timesteps. Direct Preference Optimization (DPO) is applied in the final stage to enhance visual quality and reduce artifacts, leading to smoother, more realistic outputs.
Quick Start & Requirements
pip install -e .
after cloning the repository. flash-attn
is optional.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The self-attention in the text-encoder has specific CUDA capability requirements (sm_80, sm_86, sm_90). Multi-GPU inference requires a decoupling strategy, with dedicated GPUs for text encoder and VAE decoding services. Single-GPU inference and quantization are available via the DiffSynth-Studio project.
4 months ago
1 day