Allegro  by rhymes-ai

Text-to-video model for generating short, high-quality videos

created 9 months ago
1,089 stars

Top 35.5% on sourcepulse

GitHubView on GitHub
Project Summary

Allegro is a text-to-video (T2V) and text-image-to-video (TI2V) model capable of generating up to 6-second, 15 FPS, 720p videos from textual prompts. It offers a variant, Allegro-TI2V, which also accepts first-frame and optional last-frame image inputs for more controlled generation. The project is suitable for researchers and developers looking to integrate or fine-tune advanced video generation capabilities.

How It Works

Allegro utilizes a Diffusion Transformer (DiT) architecture, building upon the Open-Sora-Plan framework. It employs a VAE for latent space manipulation and a T5 text encoder for prompt understanding. The model generates videos by progressively denoising in the latent space, achieving high-quality outputs through its large parameter count (2.8B for DiT) and optimized inference pipeline.

Quick Start & Requirements

  • Install: pip install git+https://github.com/huggingface/diffusers.git (dev version)
  • Prerequisites: Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4.
  • Inference: Requires downloading model weights for VAE, DiT, text encoder, and tokenizer.
  • Resources: Single GPU memory usage is ~9.3G (BF16 with CPU offload) or 27.5G (without offload). Inference time is ~20 mins (single H100) or ~3 mins (8xH100).
  • Links: Gallery, Hugging Face, Blog, Paper, Discord

Highlighted Details

  • Generates 720p videos at 15 FPS for up to 6 seconds.
  • Allegro-TI2V variant supports image-to-video generation.
  • Full code for Presto, a long-duration T2V model based on Allegro, is released.
  • Training and fine-tuning code is available for both T2V and TI2V models.
  • Integrated into Hugging Face diffusers library.

Maintenance & Community

The project is actively maintained with recent releases in late 2024 and early 2025. A Discord server is available for community support and discussion.

Licensing & Compatibility

  • License: Apache 2.0 License.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The model cannot render celebrities, legible text, specific locations, streets, or buildings. The current output duration is limited to 6 seconds, though a "Presto" variant is mentioned for longer durations.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Author of SGLang), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

Open-Sora-Plan by PKU-YuanGroup

0.1%
12k
Open-source project aiming to reproduce Sora-like T2V model
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.