VAR by FoundationVision

Image generation research paper using visual autoregressive modeling

Created 1 year ago

8,580 stars

Top 6.0% on SourcePulse

View on GitHub

3 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Pawel Garbacki

Cofounder of Fireworks AI

Jiaming Song

Chief Scientist at Luma AI

Project Summary

VAR introduces a novel autoregressive approach to image generation, framing it as "next-scale prediction" rather than traditional raster-scan "next-token prediction." This method enables GPT-style models to achieve state-of-the-art results, surpassing diffusion models in image generation quality and demonstrating discoverable power-law scaling. It is designed for researchers and practitioners in computer vision and generative AI seeking efficient and high-quality image synthesis.

How It Works

VAR models images by predicting subsequent, coarser resolutions of an image, effectively treating image generation as a sequence of scale predictions. This coarse-to-fine strategy allows for more scalable and efficient training of large autoregressive transformers, leading to improved performance and emergent scaling laws. The approach leverages transformer architectures, similar to those used in large language models, adapted for visual data.

Quick Start & Requirements

Installation: Install PyTorch (>=2.0.0) and other dependencies via pip3 install -r requirements.txt.
Data: Requires ImageNet dataset organized into class-specific subdirectories.
Optional: flash-attn and xformers for accelerated attention computation.
Demo: Interactive demo available at https://opensource.bytedance.com/gmpt/t2i/invite.
Code: demo_sample.ipynb provides detailed usage examples.

Highlighted Details

Achieved NeurIPS 2024 Best Paper Award.
Outperforms diffusion models in image generation benchmarks.
Demonstrates clear power-law scaling for autoregressive visual models.
Offers zero-shot generalizability across various tasks.
Pre-trained models available up to 2.0B parameters (VAR-d30) and 512x512 resolution (VAR-d36).

Maintenance & Community

The project is actively maintained by FoundationVision, with significant community adoption evidenced by numerous third-party research projects and forks listed. Further details and community interaction can be found via their GitHub repository.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

Training scripts require distributed training setup (torchrun) and significant computational resources, particularly for larger models. While sampling scripts are mentioned as forthcoming, the provided demo_sample.ipynb offers immediate inference capabilities.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

40 stars in the last 30 days