LLaVA-Mini by ictnlp

Research paper for efficient image/video understanding via large multimodal models

Created 1 year ago

554 stars

Top 57.7% on SourcePulse

Project Summary

LLaVA-Mini is a unified large multimodal model (LMM) designed for efficient image, high-resolution image, and video understanding. It targets researchers and developers seeking to reduce computational overhead and latency in vision-language tasks, offering significant performance gains with a novel single-token representation for visual input.

How It Works

LLaVA-Mini employs a single vision token to represent entire images or video frames, drastically reducing computational load compared to traditional methods that use multiple tokens. This approach, guided by interpretability insights into LMM visual token processing, achieves a 77% FLOPs reduction, lowers response latency to 40ms, and minimizes VRAM usage to 0.6MB/image, enabling the processing of long videos on limited hardware.

Quick Start & Requirements

Install: pip install -e . and pip install -e ".[train]" within a conda environment (conda create -n llavamini python=3.10 -y).
Prerequisites: Python 3.10, CUDA (implied by CUDA_VISIBLE_DEVICES), flash-attn.
Demo: Launch controller, model worker, and Gradio server using provided Python commands.
Interaction: Use python llavamini/eval/run_llava_mini.py with --image-file or --video-file.
Resource Note: --load-8bit option available for VRAM < 20GB.
Docs: Evaluation.md for benchmarks.

Highlighted Details

Achieves performance comparable to LLaVA-v1.5 using only 1 vision token (0.17% of LLaVA-v1.5's 576 tokens).
Reduces FLOPs by 77% and response latency to 40ms.
Supports processing over 10,000 video frames on a 24GB GPU.
Dynamically compresses images, weighting brighter areas more heavily.

Maintenance & Community

Built upon the LLaVA codebase.
Utilizes video instruction data from Video-ChatGPT and image instruction data from LLaVA-OneVision.
Contact: zhangshaolei20z@ict.ac.cn for questions.

Licensing & Compatibility

License not explicitly stated in the README.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented with an arXiv publication date of 2025, suggesting it may be experimental or pre-release. Licensing details required for commercial adoption are absent.

LLaVA-Mini by ictnlp

Explore Similar Projects

VideoChat-Flash by OpenGVLab

Video-T1 by liuff19

Flash-VStream by IVGSZ

LongVA by EvolvingLMMs-Lab

Long-VITA by VITA-MLLM

TimeChat by RenShuhuai-Andy

MovieChat by rese1f

LLaMA-VID by JIA-Lab-research

CogVLM2 by zai-org

Pyramid-Flow by jy0205

LLaVA-NeXT by LLaVA-VL

LLaVA by haotian-liu