TokenPacker  by CircleRadon

Visual projector for multimodal LLMs (IJCV2025 research paper)

created 1 year ago
262 stars

Top 97.8% on sourcepulse

GitHubView on GitHub
Project Summary

TokenPacker is a visual projector designed to significantly compress visual tokens for multimodal Large Language Models (LLMs), enabling more efficient high-resolution image understanding. It targets researchers and developers working with vision-language models who need to process larger images or improve inference speed. The core benefit is achieving comparable or better performance with drastically reduced token counts (75%-89% compression).

How It Works

TokenPacker employs a coarse-to-fine scheme to generate condensed visual tokens. This approach injects enriched characteristics by first processing visual information at a lower resolution and then refining it, allowing for a more efficient representation of visual features. This strategy is advantageous for reducing the computational burden associated with processing high-resolution images in multimodal LLMs.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n tokenpacker python=3.10), activate it (conda activate tokenpacker), and install the package (pip install -e .). Additional training packages can be installed with pip install -e ".[train]". Flash attention is recommended for training (pip install flash-attn --no-build-isolation).
  • Prerequisites: Python 3.10, CUDA (implied for flash-attn).
  • Resources: Training requires substantial datasets (LLaVA-Pretrain-558K, Mix665k, or Mini-Gemini variants). Checkpoints are available.
  • Links: GitHub Repo, Paper

Highlighted Details

  • Achieves 75%-89% compression of visual tokens.
  • Offers a TokenPacker-HD framework for fine-grained, high-resolution pixel-level understanding.
  • Supports various compression ratios (scale_factor [2,3,4]) and patch divisions (patch_num [9,16,25]).
  • Provides pre-trained checkpoints for TokenPacker-7b/13b and TokenPacker-HD-7b/13b models.

Maintenance & Community

The project is associated with IJCV2025 and has recent updates (October 2024) integrating with Osprey. It builds upon the LLaVA-v1.5 codebase and uses data organized by Mini-Gemini.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not detail specific limitations or known bugs. The project appears to be research-oriented, with ongoing development indicated by the TODO list.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
18 stars in the last 90 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), Nathan Lambert Nathan Lambert(AI Researcher at AI2), and
1 more.

unified-io-2 by allenai

0.3%
619
Unified-IO 2 code for training, inference, and demo
created 1 year ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.