hart  by mit-han-lab

Visual generation model using a hybrid autoregressive transformer

created 9 months ago
621 stars

Top 53.9% on sourcepulse

GitHubView on GitHub
Project Summary

HART (Hybrid Autoregressive Transformer) is a visual generation model designed to produce high-quality 1024x1024 images, aiming to rival diffusion models in quality and efficiency. It targets researchers and practitioners in generative AI seeking faster, more efficient image synthesis.

How It Works

HART employs a novel hybrid tokenizer that decomposes continuous image latents into discrete tokens (for global structure) and continuous residual tokens. A scalable-resolution discrete autoregressive model handles the discrete component, while a lightweight diffusion module (37M parameters) models the residuals. This hybrid approach significantly improves reconstruction quality over discrete-only methods and enables efficient generation of high-resolution images.

Quick Start & Requirements

  • Install: Clone the repo, create a conda environment (python=3.10), install CUDA toolkit, and run pip install -e ..
  • Prerequisites: CUDA toolkit, Python 3.10. Requires downloading pre-trained models for HART (0.7b-1024px), Qwen2-VL-1.5B-Instruct, and optionally ShieldGemma-2B for safety.
  • Demo: Launch with python app.py --model_path /path/to/model --text_model_path /path/to/Qwen2 --shield_model_path /path/to/ShieldGemma2B.
  • Inference: Use sample.py for single or multiple prompts.
  • Docs: Paper, Demo, Project.

Highlighted Details

  • Achieves FID of 5.38 on MJHQ-30K, a 31% improvement over VAR.
  • Outperforms diffusion models in FID and CLIP score.
  • Offers 4.5-7.7x higher throughput and 6.9-13.4x lower MACs.
  • Hybrid tokenizer improves reconstruction FID from 2.11 to 0.30.

Maintenance & Community

The project is from the mit-han-lab. Codebase is inspired by VAR and MAR. Key contributors include Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, and Song Han.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify any limitations or known issues. The project appears to be recent (2024).

Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
51 stars in the last 90 days

Explore Similar Projects

Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
3 more.

guided-diffusion by openai

0.2%
7k
Image synthesis codebase for diffusion models
created 4 years ago
updated 1 year ago
Feedback? Help us improve.