Visual projector for multimodal LLMs (IJCV2025 research paper)
Top 97.8% on sourcepulse
TokenPacker is a visual projector designed to significantly compress visual tokens for multimodal Large Language Models (LLMs), enabling more efficient high-resolution image understanding. It targets researchers and developers working with vision-language models who need to process larger images or improve inference speed. The core benefit is achieving comparable or better performance with drastically reduced token counts (75%-89% compression).
How It Works
TokenPacker employs a coarse-to-fine scheme to generate condensed visual tokens. This approach injects enriched characteristics by first processing visual information at a lower resolution and then refining it, allowing for a more efficient representation of visual features. This strategy is advantageous for reducing the computational burden associated with processing high-resolution images in multimodal LLMs.
Quick Start & Requirements
conda create -n tokenpacker python=3.10
), activate it (conda activate tokenpacker
), and install the package (pip install -e .
). Additional training packages can be installed with pip install -e ".[train]"
. Flash attention is recommended for training (pip install flash-attn --no-build-isolation
).Highlighted Details
scale_factor
[2,3,4]) and patch divisions (patch_num
[9,16,25]).Maintenance & Community
The project is associated with IJCV2025 and has recent updates (October 2024) integrating with Osprey. It builds upon the LLaVA-v1.5 codebase and uses data organized by Mini-Gemini.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README does not detail specific limitations or known bugs. The project appears to be research-oriented, with ongoing development indicated by the TODO list.
2 months ago
1 day