Llama-based research paper for autoregressive image generation
Top 24.3% on sourcepulse
LlamaGen offers a novel approach to image generation by adapting the autoregressive "next-token prediction" paradigm from Large Language Models (LLMs) to visual data. This method aims to achieve state-of-the-art performance through proper scaling, targeting researchers and developers interested in LLM-based generative models.
How It Works
LlamaGen utilizes a VQ-VAE to tokenize images into discrete visual tokens, which are then processed by a Llama-like autoregressive model. This approach eschews the inductive biases common in diffusion models, relying solely on scaling and next-token prediction for image synthesis. The project provides two image tokenizers (downsample ratios 16 and 8) and a range of autoregressive models from 100M to 3B parameters for both class-conditional and text-conditional generation.
Quick Start & Requirements
GETTING_STARTED.md
.python3 autoregressive/sample/sample_c2i.py
or sample_t2i.py
after downloading models. A Gradio demo is also available via app.py
.Highlighted Details
Maintenance & Community
The project is associated with HKU and ByteDance. Updates are frequent, with recent releases including image tokenizers, AR models, and vLLM support. Links to an online demo and the project page are provided.
Licensing & Compatibility
The majority of the project is licensed under the MIT License. However, it notes that portions may be under separate licenses of referred projects. This generally allows for commercial use and linking with closed-source software.
Limitations & Caveats
The text-conditional models require additional language model setup as detailed in the language/README.md
. While vLLM integration is noted, specific hardware requirements for optimal performance are not detailed.
11 months ago
1 day