Allegro by rhymes-ai

Text-to-video model for generating short, high-quality videos

Created 1 year ago

1,107 stars

Top 34.5% on SourcePulse

Project Summary

Allegro is a text-to-video (T2V) and text-image-to-video (TI2V) model capable of generating up to 6-second, 15 FPS, 720p videos from textual prompts. It offers a variant, Allegro-TI2V, which also accepts first-frame and optional last-frame image inputs for more controlled generation. The project is suitable for researchers and developers looking to integrate or fine-tune advanced video generation capabilities.

How It Works

Allegro utilizes a Diffusion Transformer (DiT) architecture, building upon the Open-Sora-Plan framework. It employs a VAE for latent space manipulation and a T5 text encoder for prompt understanding. The model generates videos by progressively denoising in the latent space, achieving high-quality outputs through its large parameter count (2.8B for DiT) and optimized inference pipeline.

Quick Start & Requirements

Install: pip install git+https://github.com/huggingface/diffusers.git (dev version)
Prerequisites: Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4.
Inference: Requires downloading model weights for VAE, DiT, text encoder, and tokenizer.
Resources: Single GPU memory usage is ~9.3G (BF16 with CPU offload) or 27.5G (without offload). Inference time is ~20 mins (single H100) or ~3 mins (8xH100).
Links: Gallery, Hugging Face, Blog, Paper, Discord

Highlighted Details

Generates 720p videos at 15 FPS for up to 6 seconds.
Allegro-TI2V variant supports image-to-video generation.
Full code for Presto, a long-duration T2V model based on Allegro, is released.
Training and fine-tuning code is available for both T2V and TI2V models.
Integrated into Hugging Face diffusers library.

Maintenance & Community

The project is actively maintained with recent releases in late 2024 and early 2025. A Discord server is available for community support and discussion.

Licensing & Compatibility

License: Apache 2.0 License.
Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The model cannot render celebrities, legible text, specific locations, streets, or buildings. The current output duration is limited to 6 seconds, though a "Presto" variant is mentioned for longer durations.

Allegro by rhymes-ai

Explore Similar Projects

TATS by songweige

FreeNoise by AILab-CVC

Gen-L-Video by G-U-N

kandinsky-5 by kandinskylab

VBench by Vchitect

Pyramid-Flow by jy0205

Step-Video-T2V by stepfun-ai

Tune-A-Video by showlab

LTX-Video by Lightricks

SkyReels-V2 by SkyworkAI

Wan2.1 by Wan-Video

Open-Sora by hpcaitech