Text-to-video generation research paper implementation
Top 89.3% on sourcepulse
This repository provides official implementations for T2V-Turbo and T2V-Turbo-v2, advanced text-to-video generation models. It targets researchers and developers looking to achieve fast, high-quality video synthesis, addressing the quality bottleneck in video consistency models through novel reward feedback and conditional guidance techniques.
How It Works
The models build upon diffusion-based video generation, incorporating techniques like mixed reward feedback and enhanced conditional guidance. T2V-Turbo-v2 specifically focuses on improving post-training through data augmentation, reward mechanisms, and refined conditional guidance, aiming for superior video quality and consistency with fewer inference steps.
Quick Start & Requirements
pip install accelerate transformers diffusers webdataset loralib peft pytorch_lightning open_clip_torch==2.24.0 hpsv2 image-reward peft wandb av einops packaging omegaconf opencv-python kornia moviepy imageio torchdata==0.8.0 decord torchaudio bitsandbytes langdetect scipy git+https://github.com/openai/CLIP.git
pip install flash-attn --no-build-isolation
after cloning https://github.com/Dao-AILab/flash-attention.git
. conda install xformers -c xformers
.python app.py
after downloading checkpoints and installing Gradio.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not specify licensing, which may impact commercial adoption. MacOS users require specific device configuration (mps
) for demos, and Intel GPU users need xpu
. Training requires specific data preparation (WebVid-10M in webdataset format) and additional model downloads.
6 months ago
1 week