VAR  by FoundationVision

Image generation research paper using visual autoregressive modeling

created 1 year ago
8,338 stars

Top 6.3% on sourcepulse

GitHubView on GitHub
Project Summary

VAR introduces a novel autoregressive approach to image generation, framing it as "next-scale prediction" rather than traditional raster-scan "next-token prediction." This method enables GPT-style models to achieve state-of-the-art results, surpassing diffusion models in image generation quality and demonstrating discoverable power-law scaling. It is designed for researchers and practitioners in computer vision and generative AI seeking efficient and high-quality image synthesis.

How It Works

VAR models images by predicting subsequent, coarser resolutions of an image, effectively treating image generation as a sequence of scale predictions. This coarse-to-fine strategy allows for more scalable and efficient training of large autoregressive transformers, leading to improved performance and emergent scaling laws. The approach leverages transformer architectures, similar to those used in large language models, adapted for visual data.

Quick Start & Requirements

  • Installation: Install PyTorch (>=2.0.0) and other dependencies via pip3 install -r requirements.txt.
  • Data: Requires ImageNet dataset organized into class-specific subdirectories.
  • Optional: flash-attn and xformers for accelerated attention computation.
  • Demo: Interactive demo available at https://opensource.bytedance.com/gmpt/t2i/invite.
  • Code: demo_sample.ipynb provides detailed usage examples.

Highlighted Details

  • Achieved NeurIPS 2024 Best Paper Award.
  • Outperforms diffusion models in image generation benchmarks.
  • Demonstrates clear power-law scaling for autoregressive visual models.
  • Offers zero-shot generalizability across various tasks.
  • Pre-trained models available up to 2.0B parameters (VAR-d30) and 512x512 resolution (VAR-d36).

Maintenance & Community

The project is actively maintained by FoundationVision, with significant community adoption evidenced by numerous third-party research projects and forks listed. Further details and community interaction can be found via their GitHub repository.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

Training scripts require distributed training setup (torchrun) and significant computational resources, particularly for larger models. While sampling scripts are mentioned as forthcoming, the provided demo_sample.ipynb offers immediate inference capabilities.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
639 stars in the last 90 days

Explore Similar Projects

Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
3 more.

guided-diffusion by openai

0.2%
7k
Image synthesis codebase for diffusion models
created 4 years ago
updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
4 more.

taming-transformers by CompVis

0.1%
6k
Image synthesis research paper using transformers
created 4 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
12 more.

stablediffusion by Stability-AI

0.1%
41k
Latent diffusion model for high-resolution image synthesis
created 2 years ago
updated 1 month ago
Feedback? Help us improve.