LLaVA-UHD  by thunlp

Efficient native-resolution encoding for multimodal LLMs

Created 1 year ago
397 stars

Top 72.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

LLaVA-UHD-v3 addresses efficient native-resolution encoding in Multimodal Large Language Models (MLLMs). Its Progressive Visual Compression (PVC) approach drastically cuts inference latency (1.9x TTFT reduction) while matching state-of-the-art performance across 15 benchmarks. This offers researchers and power users high-fidelity vision-language capabilities with enhanced efficiency.

How It Works

The ViT-UHD encoder uses Progressive Visual Compression (PVC). PVC employs Refined Patch Embedding (RPE) for flexible patch scaling and Windowed Token Compression (WTC) to merge local tokens. This reduces sequence length and computation, preserving full-scene semantics and holistic understanding, unlike slice-based methods, for efficient, high-fidelity vision-language tasks.

Quick Start & Requirements

Install via pip install "transformers>=4.51.0". Inference recommends torch with bfloat16 and flash_attention_2. Evaluation requires Conda (Python 3.10) and VLMEvalkit. Training needs Conda, flash_attn wheels, and pre-trained checkpoints (ViT-UHD, Qwen2-7B). Training is resource-intensive: ~300 hours on 32 A100 GPUs. HuggingFace models and Arxiv paper links are provided.

Highlighted Details

  • Comparable performance to Qwen2-VL across 15 benchmarks.
  • 1.9x reduction in Time-to-First-Token (TTFT).
  • Novel ViT-UHD encoder with PVC, RPE, and WTC.
  • Preserves holistic understanding and full-scene semantics.
  • LLaVA-UHD v2 showed average 3.7% performance boost over LLaVA-UHD.

Maintenance & Community

Active development is evident through multiple versions (v1-v3) and academic acceptances (ECCV2024, AAAI-26). However, the README lacks direct links to community channels (Discord, Slack) or a public roadmap.

Licensing & Compatibility

The README omits explicit license information. This is a critical adoption blocker, leaving usage rights and compatibility for commercial or closed-source applications undefined.

Limitations & Caveats

Training is computationally demanding (~300 hours on 32 A100 GPUs). Evaluation requires external tool integration (VLMEvalkit). The absence of a specified license in the README is a significant limitation, preventing clear understanding of usage rights.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI) and Phil Wang Phil Wang(Prolific Research Paper Implementer).

Cosmos-Tokenizer by NVIDIA

0.3%
2k
Suite of neural tokenizers for image and video processing
Created 1 year ago
Updated 9 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
1 more.

Sana by NVlabs

0.6%
5k
Image synthesis research paper using a linear diffusion transformer
Created 1 year ago
Updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

x-transformers by lucidrains

0.2%
6k
Transformer library with extensive experimental features
Created 5 years ago
Updated 3 weeks ago
Feedback? Help us improve.