LLaVA-UHD by thunlp

Efficient native-resolution encoding for multimodal LLMs

Created 2 years ago

423 stars

Top 69.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

Summary

LLaVA-UHD-v3 addresses efficient native-resolution encoding in Multimodal Large Language Models (MLLMs). Its Progressive Visual Compression (PVC) approach drastically cuts inference latency (1.9x TTFT reduction) while matching state-of-the-art performance across 15 benchmarks. This offers researchers and power users high-fidelity vision-language capabilities with enhanced efficiency.

How It Works

The ViT-UHD encoder uses Progressive Visual Compression (PVC). PVC employs Refined Patch Embedding (RPE) for flexible patch scaling and Windowed Token Compression (WTC) to merge local tokens. This reduces sequence length and computation, preserving full-scene semantics and holistic understanding, unlike slice-based methods, for efficient, high-fidelity vision-language tasks.

Quick Start & Requirements

Install via pip install "transformers>=4.51.0". Inference recommends torch with bfloat16 and flash_attention_2. Evaluation requires Conda (Python 3.10) and VLMEvalkit. Training needs Conda, flash_attn wheels, and pre-trained checkpoints (ViT-UHD, Qwen2-7B). Training is resource-intensive: ~300 hours on 32 A100 GPUs. HuggingFace models and Arxiv paper links are provided.

Highlighted Details

Comparable performance to Qwen2-VL across 15 benchmarks.
1.9x reduction in Time-to-First-Token (TTFT).
Novel ViT-UHD encoder with PVC, RPE, and WTC.
Preserves holistic understanding and full-scene semantics.
LLaVA-UHD v2 showed average 3.7% performance boost over LLaVA-UHD.

Maintenance & Community

Active development is evident through multiple versions (v1-v3) and academic acceptances (ECCV2024, AAAI-26). However, the README lacks direct links to community channels (Discord, Slack) or a public roadmap.

Licensing & Compatibility

The README omits explicit license information. This is a critical adoption blocker, leaving usage rights and compatibility for commercial or closed-source applications undefined.

Limitations & Caveats

Training is computationally demanding (~300 hours on 32 A100 GPUs). Evaluation requires external tool integration (VLMEvalkit). The absence of a specified license in the README is a significant limitation, preventing clear understanding of usage rights.

Health Check

Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days