LlamaGen  by FoundationVision

Llama-based research paper for autoregressive image generation

created 1 year ago
1,818 stars

Top 24.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LlamaGen offers a novel approach to image generation by adapting the autoregressive "next-token prediction" paradigm from Large Language Models (LLMs) to visual data. This method aims to achieve state-of-the-art performance through proper scaling, targeting researchers and developers interested in LLM-based generative models.

How It Works

LlamaGen utilizes a VQ-VAE to tokenize images into discrete visual tokens, which are then processed by a Llama-like autoregressive model. This approach eschews the inductive biases common in diffusion models, relying solely on scaling and next-token prediction for image synthesis. The project provides two image tokenizers (downsample ratios 16 and 8) and a range of autoregressive models from 100M to 3B parameters for both class-conditional and text-conditional generation.

Quick Start & Requirements

  • Installation: PyTorch (>=2.1.0) is required. Installation and training details are in GETTING_STARTED.md.
  • Pre-trained Models: Download weights from Hugging Face links provided in the README.
  • Demo: Run python3 autoregressive/sample/sample_c2i.py or sample_t2i.py after downloading models. A Gradio demo is also available via app.py.
  • Serving: vLLM integration is supported for faster inference.

Highlighted Details

  • Offers class-conditional models with FID scores as low as 0.59 on ImageNet (256x256).
  • Includes text-conditional models trained on LAION COCO and internal datasets.
  • Supports vLLM for up to 300%-400% speedup in serving.
  • Provides models ranging from 100M to 3.1B parameters.

Maintenance & Community

The project is associated with HKU and ByteDance. Updates are frequent, with recent releases including image tokenizers, AR models, and vLLM support. Links to an online demo and the project page are provided.

Licensing & Compatibility

The majority of the project is licensed under the MIT License. However, it notes that portions may be under separate licenses of referred projects. This generally allows for commercial use and linking with closed-source software.

Limitations & Caveats

The text-conditional models require additional language model setup as detailed in the language/README.md. While vLLM integration is noted, specific hardware requirements for optimal performance are not detailed.

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
97 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
3 more.

guided-diffusion by openai

0.2%
7k
Image synthesis codebase for diffusion models
created 4 years ago
updated 1 year ago
Feedback? Help us improve.