Qwen2.5-VL  by QwenLM

Multimodal LLM for vision-language tasks, document parsing, and agent functionality

created 11 months ago
11,849 stars

Top 4.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Qwen2.5-VL is a series of multimodal large language models designed for advanced vision-language understanding tasks. It targets researchers and developers needing to process and reason about complex visual data, including documents, objects, and videos, offering enhanced capabilities over its predecessor.

How It Works

Qwen2.5-VL integrates a streamlined ViT with window attention, SwiGLU, and RMSNorm for efficient visual encoding. It employs dynamic FPS sampling and mRoPE with temporal IDs for ultra-long video understanding. The model supports precise object grounding via absolute coordinates and JSON formats, and its agent functionality is enhanced for computer and mobile device interactions.

Quick Start & Requirements

  • Install via pip install git+https://github.com/huggingface/transformers accelerate and pip install qwen-vl-utils[decord].
  • Recommended: torchcodec for video decoding.
  • Supports Hugging Face Transformers and ModelScope.
  • See Cookbooks for detailed examples.

Highlighted Details

  • Advanced omnidocument parsing for multilingual and multi-scene documents.
  • Precise object detection, pointing, and counting with coordinate and JSON support.
  • Ultra-long video understanding (hours) and fine-grained video grounding.
  • Enhanced agent capabilities for device control.
  • Benchmarks show competitive performance against leading multimodal models like Gemini-2 Flash and GPT-4o.

Maintenance & Community

  • Active development by the Qwen team, Alibaba Cloud.
  • Links to Hugging Face, ModelScope, blog, Discord, and API documentation are provided.
  • Fine-tuning code for Qwen2-VL and Qwen2.5-VL is available.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but models are available on Hugging Face and ModelScope, implying permissive usage for research. Commercial use should be verified.

Limitations & Caveats

  • Using YaRN for long text processing may negatively impact temporal and spatial localization tasks.
  • decord backend for video processing has known issues and is not actively maintained; torchcodec is recommended.
  • FlashAttention-2 requires compatible hardware and specific PyTorch dtypes (float16 or bfloat16).
Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
70
Star History
1,783 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.