VAR  by FoundationVision

Image generation research paper using visual autoregressive modeling

Created 1 year ago
8,395 stars

Top 6.1% on SourcePulse

GitHubView on GitHub
Project Summary

VAR introduces a novel autoregressive approach to image generation, framing it as "next-scale prediction" rather than traditional raster-scan "next-token prediction." This method enables GPT-style models to achieve state-of-the-art results, surpassing diffusion models in image generation quality and demonstrating discoverable power-law scaling. It is designed for researchers and practitioners in computer vision and generative AI seeking efficient and high-quality image synthesis.

How It Works

VAR models images by predicting subsequent, coarser resolutions of an image, effectively treating image generation as a sequence of scale predictions. This coarse-to-fine strategy allows for more scalable and efficient training of large autoregressive transformers, leading to improved performance and emergent scaling laws. The approach leverages transformer architectures, similar to those used in large language models, adapted for visual data.

Quick Start & Requirements

  • Installation: Install PyTorch (>=2.0.0) and other dependencies via pip3 install -r requirements.txt.
  • Data: Requires ImageNet dataset organized into class-specific subdirectories.
  • Optional: flash-attn and xformers for accelerated attention computation.
  • Demo: Interactive demo available at https://opensource.bytedance.com/gmpt/t2i/invite.
  • Code: demo_sample.ipynb provides detailed usage examples.

Highlighted Details

  • Achieved NeurIPS 2024 Best Paper Award.
  • Outperforms diffusion models in image generation benchmarks.
  • Demonstrates clear power-law scaling for autoregressive visual models.
  • Offers zero-shot generalizability across various tasks.
  • Pre-trained models available up to 2.0B parameters (VAR-d30) and 512x512 resolution (VAR-d36).

Maintenance & Community

The project is actively maintained by FoundationVision, with significant community adoption evidenced by numerous third-party research projects and forks listed. Further details and community interaction can be found via their GitHub repository.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

Training scripts require distributed training setup (torchrun) and significant computational resources, particularly for larger models. While sampling scripts are mentioned as forthcoming, the provided demo_sample.ipynb offers immediate inference capabilities.

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
58 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
15 more.

taming-transformers by CompVis

0.1%
6k
Image synthesis research paper using transformers
Created 4 years ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), David Ha David Ha(Cofounder of Sakana AI), and
18 more.

dalle-mini by borisdayma

0.0%
15k
Text-to-image model for generating images from text prompts
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.