UniTok  by FoundationVision

Unified tokenizer for visual generation and understanding research

created 5 months ago
368 stars

Top 77.8% on sourcepulse

GitHubView on GitHub
Project Summary

UniTok provides a unified visual tokenizer designed for both image generation and understanding tasks, targeting researchers and developers building multimodal large language models (MLLMs). It enables seamless integration with autoregressive generative models and multimodal understanding models, offering a single tokenization solution for diverse MLLM architectures.

How It Works

UniTok employs a novel approach to visual tokenization, achieving state-of-the-art performance by optimizing for both generation and understanding. Benchmarks show UniTok achieves a significantly lower rFID score (0.38) compared to other methods like VILA-U (1.80) and VAR (0.90), indicating superior image reconstruction quality. For understanding tasks, it also demonstrates strong performance, outperforming VILA-U on several benchmarks when integrated into MLLMs.

Quick Start & Requirements

  • Installation: Clone the repository and install requirements via pip install -r requirements.txt.
  • Prerequisites: Python $\ge$ 3.10, PyTorch $\ge$ 2.3.1.
  • Inference: Download checkpoint, then run python inference.py --ckpt_path <path_to_tokenizer.pth> --src_img <path_to_image> --rec_img <path_to_reconstructed_image>.
  • Training: Requires DataComp-1B dataset, external models for loss calculation, and ImageNet validation set. Configuration for distributed training is available in launch.sh.
  • Demo: A Gradio demo is available on Huggingface.

Highlighted Details

  • Achieves rFID of 0.38 and 78.6% accuracy on ImageNet zero-shot classification.
  • Outperforms VILA-U on VQA, GQA, TextVQA, POPE, MME, and MM-Vet benchmarks when used with Llama-2-7B.
  • Demonstrates strong generation performance, with scores of 0.76 for Count, 0.79 for Differ, and 0.74 for Logical on GenAI-Bench.
  • Compatible with frameworks like LlamaGen, LLaVA, Chameleon, and Liquid.

Maintenance & Community

The project is led by Chuofan Ma and Xiaojuan Qi, with contributions from researchers at HKU and ByteDance. Model weights and a Gradio demo are available on Huggingface.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Training requires significant data preparation for DataComp-1B and specific configurations for distributed training. While random initialization is noted to improve downstream understanding performance, the README also mentions CLIP weight initialization boosts zero-shot accuracy, suggesting potential trade-offs depending on the use case.

Health Check
Last commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
101 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.