FastV  by pkunlp-icler

Inference acceleration for large vision-language models (research paper)

created 1 year ago
463 stars

Top 66.4% on sourcepulse

GitHubView on GitHub
Project Summary

FastV offers a plug-and-play inference acceleration method for large vision-language models (LVLMs) by pruning redundant visual tokens in deep layers. It targets researchers and engineers working with LVLMs, providing significant theoretical FLOPs reduction (up to 45%) without performance degradation, enabling faster and more efficient model deployment.

How It Works

FastV intelligently identifies and discards less informative visual tokens in later layers of LVLMs. This approach leverages the observation that visual information becomes less critical or redundant as the model processes deeper layers. By selectively pruning these tokens, FastV reduces the computational load, particularly in the self-attention mechanisms, leading to faster inference speeds and lower memory consumption.

Quick Start & Requirements

  • Install: Use conda to create an environment and run bash setup.sh from the src directory.
  • Prerequisites: Python 3.10, PyTorch, Pillow, Accelerate. HuggingFace LLaVA models require installing transformers from the repository.
  • Demo: An online demo is available at https://www.fastv.work/. Local demo can be run with python demo.py --model-path ./llava-v1.5-7b.
  • Resources: Requires a GPU for inference and demonstration. Specific latency tests were conducted on an A100 GPU.

Highlighted Details

  • Achieves up to 45% theoretical FLOPs reduction.
  • Demonstrated latency reduction of up to 25% for video understanding tasks with KV cache enabled.
  • Compatible with model quantization (e.g., 4-bit).
  • Supports HuggingFace LLaVA models and integrates with lmms-eval for benchmarking.

Maintenance & Community

The project is associated with ECCV 2024 (Oral Presentation). Key contributions are acknowledged from Zhihang Lin. Further details and discussions can be found in GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license. The code is presented for research purposes, and commercial use would require clarification of licensing terms.

Limitations & Caveats

The KV cache implementation diverges slightly from the original FastV, where pruning is uniform across subsequent decoding steps rather than per-step. The latency gains from KV cache are currently modest for single-image tasks due to short token lengths but show promise for video processing.

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
53 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.