Video generation research paper using VQ-VAE and Transformers
Top 36.6% on sourcepulse
VideoGPT offers a straightforward architecture for generative video modeling using VQ-VAE and Transformers. It's designed for researchers and practitioners looking for a reproducible, minimalistic approach to video generation, competitive with GANs on benchmark datasets like BAIR Robot.
How It Works
VideoGPT employs a two-stage process. First, a VQ-VAE with 3D convolutions and axial self-attention discretizes raw video into a sequence of latent codes. Second, a GPT-like Transformer autoregressively models these discrete latents, incorporating spatio-temporal position encodings. This approach simplifies training and allows for competitive generation quality with a clean, modular design.
Quick Start & Requirements
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
followed by pip install git+https://github.com/wilson1yan/VideoGPT.git
.cudatoolkit=11.0
), Python 3.7+, PyTorch 1.7.1. Optional sparse attention requires llvm-9-dev
and deepspeed
.Highlighted Details
Maintenance & Community
The project is associated with authors from UC Berkeley and Google. Further details on community or roadmap are not explicitly stated in the README.
Licensing & Compatibility
The repository does not explicitly state a license. The code is provided for research purposes.
Limitations & Caveats
The README notes that reproducing full paper results requires a separate, less clean codebase. The provided PyTorch version (1.7.1) is older, potentially requiring environment management for compatibility with newer libraries.
10 months ago
1 day