LLaVA-Mini  by ictnlp

Research paper for efficient image/video understanding via large multimodal models

created 6 months ago
515 stars

Top 61.6% on sourcepulse

GitHubView on GitHub
Project Summary

LLaVA-Mini is a unified large multimodal model (LMM) designed for efficient image, high-resolution image, and video understanding. It targets researchers and developers seeking to reduce computational overhead and latency in vision-language tasks, offering significant performance gains with a novel single-token representation for visual input.

How It Works

LLaVA-Mini employs a single vision token to represent entire images or video frames, drastically reducing computational load compared to traditional methods that use multiple tokens. This approach, guided by interpretability insights into LMM visual token processing, achieves a 77% FLOPs reduction, lowers response latency to 40ms, and minimizes VRAM usage to 0.6MB/image, enabling the processing of long videos on limited hardware.

Quick Start & Requirements

  • Install: pip install -e . and pip install -e ".[train]" within a conda environment (conda create -n llavamini python=3.10 -y).
  • Prerequisites: Python 3.10, CUDA (implied by CUDA_VISIBLE_DEVICES), flash-attn.
  • Demo: Launch controller, model worker, and Gradio server using provided Python commands.
  • Interaction: Use python llavamini/eval/run_llava_mini.py with --image-file or --video-file.
  • Resource Note: --load-8bit option available for VRAM < 20GB.
  • Docs: Evaluation.md for benchmarks.

Highlighted Details

  • Achieves performance comparable to LLaVA-v1.5 using only 1 vision token (0.17% of LLaVA-v1.5's 576 tokens).
  • Reduces FLOPs by 77% and response latency to 40ms.
  • Supports processing over 10,000 video frames on a 24GB GPU.
  • Dynamically compresses images, weighting brighter areas more heavily.

Maintenance & Community

  • Built upon the LLaVA codebase.
  • Utilizes video instruction data from Video-ChatGPT and image instruction data from LLaVA-OneVision.
  • Contact: zhangshaolei20z@ict.ac.cn for questions.

Licensing & Compatibility

  • License not explicitly stated in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented with an arXiv publication date of 2025, suggesting it may be experimental or pre-release. Licensing details required for commercial adoption are absent.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
60 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.