Text-to-video model for generating short, high-quality videos
Top 35.5% on sourcepulse
Allegro is a text-to-video (T2V) and text-image-to-video (TI2V) model capable of generating up to 6-second, 15 FPS, 720p videos from textual prompts. It offers a variant, Allegro-TI2V, which also accepts first-frame and optional last-frame image inputs for more controlled generation. The project is suitable for researchers and developers looking to integrate or fine-tune advanced video generation capabilities.
How It Works
Allegro utilizes a Diffusion Transformer (DiT) architecture, building upon the Open-Sora-Plan framework. It employs a VAE for latent space manipulation and a T5 text encoder for prompt understanding. The model generates videos by progressively denoising in the latent space, achieving high-quality outputs through its large parameter count (2.8B for DiT) and optimized inference pipeline.
Quick Start & Requirements
pip install git+https://github.com/huggingface/diffusers.git
(dev version)Highlighted Details
diffusers
library.Maintenance & Community
The project is actively maintained with recent releases in late 2024 and early 2025. A Discord server is available for community support and discussion.
Licensing & Compatibility
Limitations & Caveats
The model cannot render celebrities, legible text, specific locations, streets, or buildings. The current output duration is limited to 6 seconds, though a "Presto" variant is mentioned for longer durations.
5 months ago
1 day