Awesome-Multimodal-Token-Compression  by cokeshao

A comprehensive survey of multimodal token compression techniques

Created 6 months ago
270 stars

Top 95.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository serves as a comprehensive survey of multimodal token compression techniques, addressing the critical challenge of excessive tokenization in processing large image, video, and audio inputs by Multimodal Large Language Models (MLLMs). It targets researchers and engineers seeking to improve the efficiency and scalability of MLLMs for real-world applications where input data dimensions often exceed the capacity of current models. The primary benefit is a curated, organized overview of state-of-the-art methods, accelerating the understanding and adoption of efficient multimodal AI.

How It Works

This project functions as a curated collection and structured survey of academic papers focused on multimodal token compression. It categorizes research by modality (Image LLM, Video LLM, Audio LLM) and underlying architectural components (e.g., Vision Transformer, Audio Transformer). The repository provides direct links to papers, associated GitHub repositories, and Hugging Face models, enabling users to quickly access and evaluate relevant work. A key feature is a Notion database for efficient searching and filtering of the surveyed literature.

Quick Start & Requirements

This is a survey repository and does not require installation or specific software prerequisites for direct use. Users can access the survey paper via arXiv [2507.20198] and explore the curated database via the provided Notion link.

Highlighted Details

  • Comprehensive survey paper detailing multimodal long-context token compression across images, videos, and audios.
  • An interactive database for quick-search and filtering of relevant research papers.
  • Categorization of techniques by modality (Image, Video, Audio) and architecture.
  • Regular updates, including recent papers accepted by major conferences like NeurIPS'25.

Maintenance & Community

The repository shows recent activity, with updates noted in October, August, and July of 2025, indicating active maintenance. Contact information for the authors is provided for suggestions, clarifications, or collaboration opportunities.

Licensing & Compatibility

The project is licensed under the MIT License, which permits broad use, modification, and distribution, including for commercial purposes, with minimal restrictions beyond attribution.

Limitations & Caveats

As a survey, this repository is a curated snapshot of existing research and may not include every emerging technique immediately. It serves as a guide to external resources rather than providing direct implementation code or tools.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
36 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Wing Lian Wing Lian(Founder of Axolotl AI).

AnyGPT by OpenMOSS

0.1%
863
Multimodal LLM research paper for any-to-any modality conversion
Created 1 year ago
Updated 1 year ago
Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI) and Phil Wang Phil Wang(Prolific Research Paper Implementer).

Cosmos-Tokenizer by NVIDIA

0.1%
2k
Suite of neural tokenizers for image and video processing
Created 1 year ago
Updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 8 months ago
Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LWM by LargeWorldModel

0.1%
7k
Multimodal autoregressive model for long-context video/text
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.