Codebase for text-to-video applications
Top 63.8% on sourcepulse
VideoTuna is a comprehensive toolkit for text-to-video (T2V) and image-to-video (I2V) generation, offering unified pipelines for inference, fine-tuning, continuous training, and human preference alignment. It targets researchers and developers working with state-of-the-art video generation models, providing a flexible framework to adapt and improve existing models or train new ones.
How It Works
VideoTuna integrates multiple leading video generation models, including T2V, I2V, T2I, and V2V capabilities, within a single codebase. It supports advanced training techniques like LoRA and full fine-tuning, as well as human preference alignment using RLHF. The framework also includes post-processing enhancements and a novel VideoVAE+ model for improved video reconstruction.
Quick Start & Requirements
conda create -n videotuna python=3.10 -y && conda activate videotuna && pip install poetry && poetry install
.flash-attn
for Hunyuan model optimization. MacOS users with Apple Silicon should use Docker Compose due to dependency compatibility issues.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
flash-attn
or swissarmytransformer
might require retries.2 months ago
1 week