Discover and explore top open-source AI tools and projects—updated daily.
thunlpEfficient native-resolution encoding for multimodal LLMs
Top 72.5% on SourcePulse
Summary
LLaVA-UHD-v3 addresses efficient native-resolution encoding in Multimodal Large Language Models (MLLMs). Its Progressive Visual Compression (PVC) approach drastically cuts inference latency (1.9x TTFT reduction) while matching state-of-the-art performance across 15 benchmarks. This offers researchers and power users high-fidelity vision-language capabilities with enhanced efficiency.
How It Works
The ViT-UHD encoder uses Progressive Visual Compression (PVC). PVC employs Refined Patch Embedding (RPE) for flexible patch scaling and Windowed Token Compression (WTC) to merge local tokens. This reduces sequence length and computation, preserving full-scene semantics and holistic understanding, unlike slice-based methods, for efficient, high-fidelity vision-language tasks.
Quick Start & Requirements
Install via pip install "transformers>=4.51.0". Inference recommends torch with bfloat16 and flash_attention_2. Evaluation requires Conda (Python 3.10) and VLMEvalkit. Training needs Conda, flash_attn wheels, and pre-trained checkpoints (ViT-UHD, Qwen2-7B). Training is resource-intensive: ~300 hours on 32 A100 GPUs. HuggingFace models and Arxiv paper links are provided.
Highlighted Details
Maintenance & Community
Active development is evident through multiple versions (v1-v3) and academic acceptances (ECCV2024, AAAI-26). However, the README lacks direct links to community channels (Discord, Slack) or a public roadmap.
Licensing & Compatibility
The README omits explicit license information. This is a critical adoption blocker, leaving usage rights and compatibility for commercial or closed-source applications undefined.
Limitations & Caveats
Training is computationally demanding (~300 hours on 32 A100 GPUs). Evaluation requires external tool integration (VLMEvalkit). The absence of a specified license in the README is a significant limitation, preventing clear understanding of usage rights.
3 days ago
Inactive
InternLM
NVlabs
QwenLM
lucidrains