Allegro  by rhymes-ai

Text-to-video model for generating short, high-quality videos

Created 11 months ago
1,095 stars

Top 34.8% on SourcePulse

GitHubView on GitHub
Project Summary

Allegro is a text-to-video (T2V) and text-image-to-video (TI2V) model capable of generating up to 6-second, 15 FPS, 720p videos from textual prompts. It offers a variant, Allegro-TI2V, which also accepts first-frame and optional last-frame image inputs for more controlled generation. The project is suitable for researchers and developers looking to integrate or fine-tune advanced video generation capabilities.

How It Works

Allegro utilizes a Diffusion Transformer (DiT) architecture, building upon the Open-Sora-Plan framework. It employs a VAE for latent space manipulation and a T5 text encoder for prompt understanding. The model generates videos by progressively denoising in the latent space, achieving high-quality outputs through its large parameter count (2.8B for DiT) and optimized inference pipeline.

Quick Start & Requirements

  • Install: pip install git+https://github.com/huggingface/diffusers.git (dev version)
  • Prerequisites: Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4.
  • Inference: Requires downloading model weights for VAE, DiT, text encoder, and tokenizer.
  • Resources: Single GPU memory usage is ~9.3G (BF16 with CPU offload) or 27.5G (without offload). Inference time is ~20 mins (single H100) or ~3 mins (8xH100).
  • Links: Gallery, Hugging Face, Blog, Paper, Discord

Highlighted Details

  • Generates 720p videos at 15 FPS for up to 6 seconds.
  • Allegro-TI2V variant supports image-to-video generation.
  • Full code for Presto, a long-duration T2V model based on Allegro, is released.
  • Training and fine-tuning code is available for both T2V and TI2V models.
  • Integrated into Hugging Face diffusers library.

Maintenance & Community

The project is actively maintained with recent releases in late 2024 and early 2025. A Discord server is available for community support and discussion.

Licensing & Compatibility

  • License: Apache 2.0 License.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The model cannot render celebrities, legible text, specific locations, streets, or buildings. The current output duration is limited to 6 seconds, though a "Presto" variant is mentioned for longer durations.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

SkyReels-V2 by SkyworkAI

3.3%
4k
Film generation model for infinite-length videos using diffusion forcing
Created 5 months ago
Updated 1 month ago
Feedback? Help us improve.