Research paper for efficient image/video understanding via large multimodal models
Top 61.6% on sourcepulse
LLaVA-Mini is a unified large multimodal model (LMM) designed for efficient image, high-resolution image, and video understanding. It targets researchers and developers seeking to reduce computational overhead and latency in vision-language tasks, offering significant performance gains with a novel single-token representation for visual input.
How It Works
LLaVA-Mini employs a single vision token to represent entire images or video frames, drastically reducing computational load compared to traditional methods that use multiple tokens. This approach, guided by interpretability insights into LMM visual token processing, achieves a 77% FLOPs reduction, lowers response latency to 40ms, and minimizes VRAM usage to 0.6MB/image, enabling the processing of long videos on limited hardware.
Quick Start & Requirements
pip install -e .
and pip install -e ".[train]"
within a conda
environment (conda create -n llavamini python=3.10 -y
).CUDA_VISIBLE_DEVICES
), flash-attn
.python llavamini/eval/run_llava_mini.py
with --image-file
or --video-file
.--load-8bit
option available for VRAM < 20GB.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is presented with an arXiv publication date of 2025, suggesting it may be experimental or pre-release. Licensing details required for commercial adoption are absent.
1 month ago
1 week