TokenPacker by CircleRadon

Visual projector for multimodal LLMs (IJCV2025 research paper)

Created 1 year ago

275 stars

Top 94.1% on SourcePulse

Project Summary

TokenPacker is a visual projector designed to significantly compress visual tokens for multimodal Large Language Models (LLMs), enabling more efficient high-resolution image understanding. It targets researchers and developers working with vision-language models who need to process larger images or improve inference speed. The core benefit is achieving comparable or better performance with drastically reduced token counts (75%-89% compression).

How It Works

TokenPacker employs a coarse-to-fine scheme to generate condensed visual tokens. This approach injects enriched characteristics by first processing visual information at a lower resolution and then refining it, allowing for a more efficient representation of visual features. This strategy is advantageous for reducing the computational burden associated with processing high-resolution images in multimodal LLMs.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n tokenpacker python=3.10), activate it (conda activate tokenpacker), and install the package (pip install -e .). Additional training packages can be installed with pip install -e ".[train]". Flash attention is recommended for training (pip install flash-attn --no-build-isolation).
Prerequisites: Python 3.10, CUDA (implied for flash-attn).
Resources: Training requires substantial datasets (LLaVA-Pretrain-558K, Mix665k, or Mini-Gemini variants). Checkpoints are available.
Links: GitHub Repo, Paper

Highlighted Details

Achieves 75%-89% compression of visual tokens.
Offers a TokenPacker-HD framework for fine-grained, high-resolution pixel-level understanding.
Supports various compression ratios (scale_factor [2,3,4]) and patch divisions (patch_num [9,16,25]).
Provides pre-trained checkpoints for TokenPacker-7b/13b and TokenPacker-HD-7b/13b models.

Maintenance & Community

The project is associated with IJCV2025 and has recent updates (October 2024) integrating with Osprey. It builds upon the LLaVA-v1.5 codebase and uses data organized by Mini-Gemini.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not detail specific limitations or known bugs. The project appears to be research-oriented, with ongoing development indicated by the TODO list.

TokenPacker by CircleRadon

Explore Similar Projects

TokenFlow by ByteVisionLab

LLaVA-UHD by thunlp

ComfyUI-OmniGen by 1038lab

HPT by HyperGAI

Open-LLaVA-NeXT by xiaoachen98

comfyui_HiDream-Sampler by lum3on

Lumina-Image-2.0 by Alpha-VLLM

Gemini by kyegomez

Cosmos-Tokenizer by NVIDIA

LlamaGen by FoundationVision

Sana by NVlabs

minimind-v by jingyaogong