LlamaGen by FoundationVision

Llama-based research paper for autoregressive image generation

Created 1 year ago

1,918 stars

Top 22.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

LlamaGen offers a novel approach to image generation by adapting the autoregressive "next-token prediction" paradigm from Large Language Models (LLMs) to visual data. This method aims to achieve state-of-the-art performance through proper scaling, targeting researchers and developers interested in LLM-based generative models.

How It Works

LlamaGen utilizes a VQ-VAE to tokenize images into discrete visual tokens, which are then processed by a Llama-like autoregressive model. This approach eschews the inductive biases common in diffusion models, relying solely on scaling and next-token prediction for image synthesis. The project provides two image tokenizers (downsample ratios 16 and 8) and a range of autoregressive models from 100M to 3B parameters for both class-conditional and text-conditional generation.

Quick Start & Requirements

Installation: PyTorch (>=2.1.0) is required. Installation and training details are in GETTING_STARTED.md.
Pre-trained Models: Download weights from Hugging Face links provided in the README.
Demo: Run python3 autoregressive/sample/sample_c2i.py or sample_t2i.py after downloading models. A Gradio demo is also available via app.py.
Serving: vLLM integration is supported for faster inference.

Highlighted Details

Offers class-conditional models with FID scores as low as 0.59 on ImageNet (256x256).
Includes text-conditional models trained on LAION COCO and internal datasets.
Supports vLLM for up to 300%-400% speedup in serving.
Provides models ranging from 100M to 3.1B parameters.

Maintenance & Community

The project is associated with HKU and ByteDance. Updates are frequent, with recent releases including image tokenizers, AR models, and vLLM support. Links to an online demo and the project page are provided.

Licensing & Compatibility

The majority of the project is licensed under the MIT License. However, it notes that portions may be under separate licenses of referred projects. This generally allows for commercial use and linking with closed-source software.

Limitations & Caveats

The text-conditional models require additional language model setup as detailed in the language/README.md. While vLLM integration is noted, specific hardware requirements for optimal performance are not detailed.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days