VisionZip  by dvlab-research

Vision-language model research paper for efficient VLMs

Created 9 months ago
347 stars

Top 80.0% on SourcePulse

GitHubView on GitHub
Project Summary

VisionZip addresses the computational inefficiency of Vision Language Models (VLMs) by drastically reducing the number of visual tokens processed without significant performance loss. Targeting researchers and developers working with VLMs, it offers substantial speedups and memory savings during inference and training.

How It Works

VisionZip employs a text-agnostic method to select a small subset of dominant and contextual visual tokens from the input sequence. This approach aims to retain the most salient information while discarding redundant or less informative tokens, leading to faster processing and reduced memory footprint. Its text-agnostic nature allows it to be integrated with any VLM architecture and existing LLM acceleration techniques.

Quick Start & Requirements

  • Install via pip: pip install visionzip
  • For development: Clone the repo and run pip install -e .
  • Requires a LLaVA environment.
  • Official Demo: Hugging Face Space
  • Usage Video: Usage-Video

Highlighted Details

  • Achieves state-of-the-art performance among efficient VLM methods.
  • Retains ~10% of visual tokens, achieving ~95% of performance in training-free mode.
  • Applicable during inference, efficient tuning, and training stages, saving memory and time.
  • Significantly reduces prefilling and total inference time with KV cache.

Maintenance & Community

  • Accepted by CV CVPR 2025.
  • Built upon LLaVA, mini-Gemini, Lmms-Eval, and Video-LLaVA.
  • Demo-Chat code available in a 'demo-chat' branch for interactive analysis.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is presented as a CVPR 2025 submission, indicating it is a recent research artifact. While claiming minimal performance degradation, the exact impact on specific downstream tasks or edge cases is not detailed.

Health Check
Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 1 year ago
Updated 6 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.