TokenPacker  by CircleRadon

Visual projector for multimodal LLMs (IJCV2025 research paper)

Created 1 year ago
267 stars

Top 95.9% on SourcePulse

GitHubView on GitHub
Project Summary

TokenPacker is a visual projector designed to significantly compress visual tokens for multimodal Large Language Models (LLMs), enabling more efficient high-resolution image understanding. It targets researchers and developers working with vision-language models who need to process larger images or improve inference speed. The core benefit is achieving comparable or better performance with drastically reduced token counts (75%-89% compression).

How It Works

TokenPacker employs a coarse-to-fine scheme to generate condensed visual tokens. This approach injects enriched characteristics by first processing visual information at a lower resolution and then refining it, allowing for a more efficient representation of visual features. This strategy is advantageous for reducing the computational burden associated with processing high-resolution images in multimodal LLMs.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n tokenpacker python=3.10), activate it (conda activate tokenpacker), and install the package (pip install -e .). Additional training packages can be installed with pip install -e ".[train]". Flash attention is recommended for training (pip install flash-attn --no-build-isolation).
  • Prerequisites: Python 3.10, CUDA (implied for flash-attn).
  • Resources: Training requires substantial datasets (LLaVA-Pretrain-558K, Mix665k, or Mini-Gemini variants). Checkpoints are available.
  • Links: GitHub Repo, Paper

Highlighted Details

  • Achieves 75%-89% compression of visual tokens.
  • Offers a TokenPacker-HD framework for fine-grained, high-resolution pixel-level understanding.
  • Supports various compression ratios (scale_factor [2,3,4]) and patch divisions (patch_num [9,16,25]).
  • Provides pre-trained checkpoints for TokenPacker-7b/13b and TokenPacker-HD-7b/13b models.

Maintenance & Community

The project is associated with IJCV2025 and has recent updates (October 2024) integrating with Osprey. It builds upon the LLaVA-v1.5 codebase and uses data organized by Mini-Gemini.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not detail specific limitations or known bugs. The project appears to be research-oriented, with ongoing development indicated by the TODO list.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI) and Phil Wang Phil Wang(Prolific Research Paper Implementer).

Cosmos-Tokenizer by NVIDIA

0.1%
2k
Suite of neural tokenizers for image and video processing
Created 10 months ago
Updated 7 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
1 more.

Sana by NVlabs

0.4%
4k
Image synthesis research paper using a linear diffusion transformer
Created 11 months ago
Updated 5 days ago
Feedback? Help us improve.