Step-Video-T2V by stepfun-ai

Text-to-video model for generating high-fidelity, dynamic videos

Created 11 months ago

3,149 stars

Top 15.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

Step-Video-T2V is a 30-billion parameter text-to-video model capable of generating videos up to 204 frames. It targets researchers and developers in AI video generation, offering state-of-the-art quality and efficiency through novel compression and optimization techniques.

How It Works

The model utilizes a deep compression Video-VAE for 16x16 spatial and 8x temporal compression, significantly improving training and inference efficiency. Video generation is handled by a Diffusion Transformer (DiT) with 3D full attention, conditioned on text embeddings from bilingual encoders and timesteps. Direct Preference Optimization (DPO) is applied in the final stage to enhance visual quality and reduce artifacts, leading to smoother, more realistic outputs.

Quick Start & Requirements

Install: pip install -e . after cloning the repository. flash-attn is optional.
Prerequisites: Python >= 3.10, PyTorch >= 2.3-cu121, CUDA Toolkit, FFmpeg. Requires NVIDIA GPU with CUDA support. Text encoder self-attention requires CUDA capabilities sm_80, sm_86, or sm_90.
Hardware: Peak GPU memory ranges from 72.48 GB to 78.55 GB for generating 204 frames at 768x768 resolution. Recommended: 80GB GPU. Tested on four GPUs.
Links: Huggingface, Modelscope, Technical Report, Online Engine

Highlighted Details

Generates videos up to 204 frames with 16x16 spatial and 8x temporal compression.
Employs a DiT architecture with 3D full attention and 3D RoPE for handling varying video lengths.
Incorporates Direct Preference Optimization (DPO) for enhanced visual quality and artifact reduction.
Evaluated on a novel benchmark, Step-Video-T2V-Eval, featuring 128 Chinese prompts.

Maintenance & Community

Code will be integrated into the official Huggingface/Diffusers repository.
Collaboration with the FastVideo team for inference acceleration solutions.

Licensing & Compatibility

License details are not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The self-attention in the text-encoder has specific CUDA capability requirements (sm_80, sm_86, sm_90). Multi-GPU inference requires a decoupling strategy, with dedicated GPUs for text encoder and VAE decoding services. Single-GPU inference and quantization are available via the DiffSynth-Studio project.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days