Step-Video-T2V  by stepfun-ai

Text-to-video model for generating high-fidelity, dynamic videos

Created 7 months ago
3,111 stars

Top 15.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Step-Video-T2V is a 30-billion parameter text-to-video model capable of generating videos up to 204 frames. It targets researchers and developers in AI video generation, offering state-of-the-art quality and efficiency through novel compression and optimization techniques.

How It Works

The model utilizes a deep compression Video-VAE for 16x16 spatial and 8x temporal compression, significantly improving training and inference efficiency. Video generation is handled by a Diffusion Transformer (DiT) with 3D full attention, conditioned on text embeddings from bilingual encoders and timesteps. Direct Preference Optimization (DPO) is applied in the final stage to enhance visual quality and reduce artifacts, leading to smoother, more realistic outputs.

Quick Start & Requirements

  • Install: pip install -e . after cloning the repository. flash-attn is optional.
  • Prerequisites: Python >= 3.10, PyTorch >= 2.3-cu121, CUDA Toolkit, FFmpeg. Requires NVIDIA GPU with CUDA support. Text encoder self-attention requires CUDA capabilities sm_80, sm_86, or sm_90.
  • Hardware: Peak GPU memory ranges from 72.48 GB to 78.55 GB for generating 204 frames at 768x768 resolution. Recommended: 80GB GPU. Tested on four GPUs.
  • Links: Huggingface, Modelscope, Technical Report, Online Engine

Highlighted Details

  • Generates videos up to 204 frames with 16x16 spatial and 8x temporal compression.
  • Employs a DiT architecture with 3D full attention and 3D RoPE for handling varying video lengths.
  • Incorporates Direct Preference Optimization (DPO) for enhanced visual quality and artifact reduction.
  • Evaluated on a novel benchmark, Step-Video-T2V-Eval, featuring 128 Chinese prompts.

Maintenance & Community

  • Code will be integrated into the official Huggingface/Diffusers repository.
  • Collaboration with the FastVideo team for inference acceleration solutions.

Licensing & Compatibility

  • License details are not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The self-attention in the text-encoder has specific CUDA capability requirements (sm_80, sm_86, sm_90). Multi-GPU inference requires a decoupling strategy, with dedicated GPUs for text encoder and VAE decoding services. Single-GPU inference and quantization are available via the DiffSynth-Studio project.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
3
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

SkyReels-V2 by SkyworkAI

3.3%
4k
Film generation model for infinite-length videos using diffusion forcing
Created 5 months ago
Updated 1 month ago
Feedback? Help us improve.