Pyramid-Flow by jy0205

Video generation method based on flow matching

Created 1 year ago

3,147 stars

Top 15.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Saining Xie

Professor at NYU

Jiaming Song

Chief Scientist at Luma AI

Project Summary

Pyramid Flow is an ICLR 2025 paper presenting a training-efficient autoregressive video generation method based on Flow Matching. It targets researchers and developers in generative modeling, enabling the creation of high-quality, long-duration videos (up to 10 seconds at 768p, 24 FPS) using only open-source datasets, with support for image-to-video generation.

How It Works

Pyramid Flow leverages Flow Matching to interpolate between latents of different resolutions and noise levels. This approach allows for simultaneous generation and decompression, improving computational efficiency compared to models operating solely at full resolution. The framework uses a single DiT (Diffusion Transformer) and is end-to-end optimized, achieving its performance targets within a reasonable training budget.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (python==3.8.10, pytorch==2.1.2), and install requirements (pip install -r requirements.txt).
Model Download: Download checkpoints from Hugging Face (rain1011/pyramid-flow-miniflux or rain1011/pyramid-flow-sd3).
Inference: Run python app.py for a Gradio demo or use the provided inference code (video_generation_demo.ipynb).
Hardware: Supports multi-GPU inference and CPU offloading for low GPU memory usage (<8GB with model.enable_sequential_cpu_offload()). MPS backend is also supported for Apple Silicon.
Links: Paper, Project Page, miniFLUX Model, SD3 Model, Hugging Face Demo.

Highlighted Details

Generates 10-second videos at 768p resolution and 24 FPS.
Supports image-to-video generation.
Achieves comparable performance to commercial models like Kling and Gen-3 Alpha on VBench.
Offers significant speedups with multi-GPU inference (e.g., 2.5 minutes for 5s, 768p, 24fps on 4 A100s).
Low GPU memory inference (<8GB) is possible via CPU offloading.

Maintenance & Community

The project is actively updated, with recent releases including a 768p miniFLUX checkpoint, training code for VAE and DiT finetuning, and improved inference capabilities. Community contributions are acknowledged, and a Hugging Face Space demo is available.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require clarification of the license.

Limitations & Caveats

The README mentions that the bf16 dtype is supported, but fp16 is not yet supported for the model. Training VAE and finetuning DiT require substantial hardware resources (at least 8 A100 GPUs).

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

21 stars in the last 30 days