Multimodal LLM for vision-language tasks, document parsing, and agent functionality
Top 4.3% on sourcepulse
Qwen2.5-VL is a series of multimodal large language models designed for advanced vision-language understanding tasks. It targets researchers and developers needing to process and reason about complex visual data, including documents, objects, and videos, offering enhanced capabilities over its predecessor.
How It Works
Qwen2.5-VL integrates a streamlined ViT with window attention, SwiGLU, and RMSNorm for efficient visual encoding. It employs dynamic FPS sampling and mRoPE with temporal IDs for ultra-long video understanding. The model supports precise object grounding via absolute coordinates and JSON formats, and its agent functionality is enhanced for computer and mobile device interactions.
Quick Start & Requirements
pip install git+https://github.com/huggingface/transformers accelerate
and pip install qwen-vl-utils[decord]
.torchcodec
for video decoding.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
decord
backend for video processing has known issues and is not actively maintained; torchcodec
is recommended.float16
or bfloat16
).2 months ago
1 day