Discover and explore top open-source AI tools and projects—updated daily.
QwenLMMultimodal LLM for vision-language tasks, document parsing, and agent functionality
Top 2.6% on SourcePulse
Qwen2.5-VL is a series of multimodal large language models designed for advanced vision-language understanding tasks. It targets researchers and developers needing to process and reason about complex visual data, including documents, objects, and videos, offering enhanced capabilities over its predecessor.
How It Works
Qwen2.5-VL integrates a streamlined ViT with window attention, SwiGLU, and RMSNorm for efficient visual encoding. It employs dynamic FPS sampling and mRoPE with temporal IDs for ultra-long video understanding. The model supports precise object grounding via absolute coordinates and JSON formats, and its agent functionality is enhanced for computer and mobile device interactions.
Quick Start & Requirements
pip install git+https://github.com/huggingface/transformers accelerate and pip install qwen-vl-utils[decord].torchcodec for video decoding.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
decord backend for video processing has known issues and is not actively maintained; torchcodec is recommended.float16 or bfloat16).1 week ago
1 day
InternLM
zai-org
QwenLM
deepseek-ai