Unified tokenizer for visual generation and understanding research
Top 77.8% on sourcepulse
UniTok provides a unified visual tokenizer designed for both image generation and understanding tasks, targeting researchers and developers building multimodal large language models (MLLMs). It enables seamless integration with autoregressive generative models and multimodal understanding models, offering a single tokenization solution for diverse MLLM architectures.
How It Works
UniTok employs a novel approach to visual tokenization, achieving state-of-the-art performance by optimizing for both generation and understanding. Benchmarks show UniTok achieves a significantly lower rFID score (0.38) compared to other methods like VILA-U (1.80) and VAR (0.90), indicating superior image reconstruction quality. For understanding tasks, it also demonstrates strong performance, outperforming VILA-U on several benchmarks when integrated into MLLMs.
Quick Start & Requirements
pip install -r requirements.txt
.python inference.py --ckpt_path <path_to_tokenizer.pth> --src_img <path_to_image> --rec_img <path_to_reconstructed_image>
.launch.sh
.Highlighted Details
Maintenance & Community
The project is led by Chuofan Ma and Xiaojuan Qi, with contributions from researchers at HKU and ByteDance. Model weights and a Gradio demo are available on Huggingface.
Licensing & Compatibility
Licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Training requires significant data preparation for DataComp-1B and specific configurations for distributed training. While random initialization is noted to improve downstream understanding performance, the README also mentions CLIP weight initialization boosts zero-shot accuracy, suggesting potential trade-offs depending on the use case.
3 days ago
1 day