UniTok by FoundationVision

Unified tokenizer for visual generation and understanding research

Created 10 months ago

496 stars

Top 62.6% on SourcePulse

Project Summary

UniTok provides a unified visual tokenizer designed for both image generation and understanding tasks, targeting researchers and developers building multimodal large language models (MLLMs). It enables seamless integration with autoregressive generative models and multimodal understanding models, offering a single tokenization solution for diverse MLLM architectures.

How It Works

UniTok employs a novel approach to visual tokenization, achieving state-of-the-art performance by optimizing for both generation and understanding. Benchmarks show UniTok achieves a significantly lower rFID score (0.38) compared to other methods like VILA-U (1.80) and VAR (0.90), indicating superior image reconstruction quality. For understanding tasks, it also demonstrates strong performance, outperforming VILA-U on several benchmarks when integrated into MLLMs.

Quick Start & Requirements

Installation: Clone the repository and install requirements via pip install -r requirements.txt.
Prerequisites: Python $\ge$ 3.10, PyTorch $\ge$ 2.3.1.
Inference: Download checkpoint, then run python inference.py --ckpt_path <path_to_tokenizer.pth> --src_img <path_to_image> --rec_img <path_to_reconstructed_image>.
Training: Requires DataComp-1B dataset, external models for loss calculation, and ImageNet validation set. Configuration for distributed training is available in launch.sh.
Demo: A Gradio demo is available on Huggingface.

Highlighted Details

Achieves rFID of 0.38 and 78.6% accuracy on ImageNet zero-shot classification.
Outperforms VILA-U on VQA, GQA, TextVQA, POPE, MME, and MM-Vet benchmarks when used with Llama-2-7B.
Demonstrates strong generation performance, with scores of 0.76 for Count, 0.79 for Differ, and 0.74 for Logical on GenAI-Bench.
Compatible with frameworks like LlamaGen, LLaVA, Chameleon, and Liquid.

Maintenance & Community

The project is led by Chuofan Ma and Xiaojuan Qi, with contributions from researchers at HKU and ByteDance. Model weights and a Gradio demo are available on Huggingface.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Training requires significant data preparation for DataComp-1B and specific configurations for distributed training. While random initialization is noted to improve downstream understanding performance, the README also mentions CLIP weight initialization boosts zero-shot accuracy, suggesting potential trade-offs depending on the use case.

UniTok by FoundationVision

Explore Similar Projects

TokenFlow by ByteVisionLab

DreamLLM by RunpeiDong

Ovis-U1 by AIDC-AI

CM3Leon by kyegomez

VoRA by Hon-Wong

LaVIT by jy0205

Awesome-Unified-Multimodal-Models by AIDC-AI

Liquid by FoundationVision

VLM2Vec by TIGER-AI-Lab

magma by Aleph-Alpha-Research

mlx-examples by ml-explore

Janus by deepseek-ai