FastV by pkunlp-icler

Inference acceleration for large vision-language models (research paper)

Created 1 year ago

540 stars

Top 58.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

FastV offers a plug-and-play inference acceleration method for large vision-language models (LVLMs) by pruning redundant visual tokens in deep layers. It targets researchers and engineers working with LVLMs, providing significant theoretical FLOPs reduction (up to 45%) without performance degradation, enabling faster and more efficient model deployment.

How It Works

FastV intelligently identifies and discards less informative visual tokens in later layers of LVLMs. This approach leverages the observation that visual information becomes less critical or redundant as the model processes deeper layers. By selectively pruning these tokens, FastV reduces the computational load, particularly in the self-attention mechanisms, leading to faster inference speeds and lower memory consumption.

Quick Start & Requirements

Install: Use conda to create an environment and run bash setup.sh from the src directory.
Prerequisites: Python 3.10, PyTorch, Pillow, Accelerate. HuggingFace LLaVA models require installing transformers from the repository.
Demo: An online demo is available at https://www.fastv.work/. Local demo can be run with python demo.py --model-path ./llava-v1.5-7b.
Resources: Requires a GPU for inference and demonstration. Specific latency tests were conducted on an A100 GPU.

Highlighted Details

Achieves up to 45% theoretical FLOPs reduction.
Demonstrated latency reduction of up to 25% for video understanding tasks with KV cache enabled.
Compatible with model quantization (e.g., 4-bit).
Supports HuggingFace LLaVA models and integrates with lmms-eval for benchmarking.

Maintenance & Community

The project is associated with ECCV 2024 (Oral Presentation). Key contributions are acknowledged from Zhihang Lin. Further details and discussions can be found in GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license. The code is presented for research purposes, and commercial use would require clarification of licensing terms.

Limitations & Caveats

The KV cache implementation diverges slightly from the original FastV, where pruning is uniform across subsequent decoding steps rather than per-step. The latency gains from KV cache are currently modest for single-image tasks due to short token lengths but show promise for video processing.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days