Qwen3-VL by QwenLM

Multimodal LLM for vision-language tasks, document parsing, and agent functionality

Created 1 year ago

17,673 stars

Top 2.6% on SourcePulse

View on GitHub

12 Experts Love This Project

Shang Wang

Cofounder of CentML; Manager AI Systems at NVIDIA

Travis Fischer

Founder of Agentic

Gregor Zunic

Cofounder of Browser Use

Magnus Müller

Cofounder of Browser Use

and 8 more!

Project Summary

Qwen2.5-VL is a series of multimodal large language models designed for advanced vision-language understanding tasks. It targets researchers and developers needing to process and reason about complex visual data, including documents, objects, and videos, offering enhanced capabilities over its predecessor.

How It Works

Qwen2.5-VL integrates a streamlined ViT with window attention, SwiGLU, and RMSNorm for efficient visual encoding. It employs dynamic FPS sampling and mRoPE with temporal IDs for ultra-long video understanding. The model supports precise object grounding via absolute coordinates and JSON formats, and its agent functionality is enhanced for computer and mobile device interactions.

Quick Start & Requirements

Install via pip install git+https://github.com/huggingface/transformers accelerate and pip install qwen-vl-utils[decord].
Recommended: torchcodec for video decoding.
Supports Hugging Face Transformers and ModelScope.
See Cookbooks for detailed examples.

Highlighted Details

Advanced omnidocument parsing for multilingual and multi-scene documents.
Precise object detection, pointing, and counting with coordinate and JSON support.
Ultra-long video understanding (hours) and fine-grained video grounding.
Enhanced agent capabilities for device control.
Benchmarks show competitive performance against leading multimodal models like Gemini-2 Flash and GPT-4o.

Maintenance & Community

Active development by the Qwen team, Alibaba Cloud.
Links to Hugging Face, ModelScope, blog, Discord, and API documentation are provided.
Fine-tuning code for Qwen2-VL and Qwen2.5-VL is available.

Licensing & Compatibility

The specific license is not explicitly stated in the README, but models are available on Hugging Face and ModelScope, implying permissive usage for research. Commercial use should be verified.

Limitations & Caveats

Using YaRN for long text processing may negatively impact temporal and spatial localization tasks.
decord backend for video processing has known issues and is not actively maintained; torchcodec is recommended.
FlashAttention-2 requires compatible hardware and specific PyTorch dtypes (float16 or bfloat16).

Health Check

Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

598 stars in the last 30 days