Inference acceleration for large vision-language models (research paper)
Top 66.4% on sourcepulse
FastV offers a plug-and-play inference acceleration method for large vision-language models (LVLMs) by pruning redundant visual tokens in deep layers. It targets researchers and engineers working with LVLMs, providing significant theoretical FLOPs reduction (up to 45%) without performance degradation, enabling faster and more efficient model deployment.
How It Works
FastV intelligently identifies and discards less informative visual tokens in later layers of LVLMs. This approach leverages the observation that visual information becomes less critical or redundant as the model processes deeper layers. By selectively pruning these tokens, FastV reduces the computational load, particularly in the self-attention mechanisms, leading to faster inference speeds and lower memory consumption.
Quick Start & Requirements
conda
to create an environment and run bash setup.sh
from the src
directory.python demo.py --model-path ./llava-v1.5-7b
.Highlighted Details
Maintenance & Community
The project is associated with ECCV 2024 (Oral Presentation). Key contributions are acknowledged from Zhihang Lin. Further details and discussions can be found in GitHub issues.
Licensing & Compatibility
The repository does not explicitly state a license. The code is presented for research purposes, and commercial use would require clarification of licensing terms.
Limitations & Caveats
The KV cache implementation diverges slightly from the original FastV, where pruning is uniform across subsequent decoding steps rather than per-step. The latency gains from KV cache are currently modest for single-image tasks due to short token lengths but show promise for video processing.
7 months ago
1 week