Qwen2.5-VL  by QwenLM

Multimodal LLM for vision-language tasks, document parsing, and agent functionality

Created 1 year ago
12,528 stars

Top 4.0% on SourcePulse

GitHubView on GitHub
Project Summary

Qwen2.5-VL is a series of multimodal large language models designed for advanced vision-language understanding tasks. It targets researchers and developers needing to process and reason about complex visual data, including documents, objects, and videos, offering enhanced capabilities over its predecessor.

How It Works

Qwen2.5-VL integrates a streamlined ViT with window attention, SwiGLU, and RMSNorm for efficient visual encoding. It employs dynamic FPS sampling and mRoPE with temporal IDs for ultra-long video understanding. The model supports precise object grounding via absolute coordinates and JSON formats, and its agent functionality is enhanced for computer and mobile device interactions.

Quick Start & Requirements

  • Install via pip install git+https://github.com/huggingface/transformers accelerate and pip install qwen-vl-utils[decord].
  • Recommended: torchcodec for video decoding.
  • Supports Hugging Face Transformers and ModelScope.
  • See Cookbooks for detailed examples.

Highlighted Details

  • Advanced omnidocument parsing for multilingual and multi-scene documents.
  • Precise object detection, pointing, and counting with coordinate and JSON support.
  • Ultra-long video understanding (hours) and fine-grained video grounding.
  • Enhanced agent capabilities for device control.
  • Benchmarks show competitive performance against leading multimodal models like Gemini-2 Flash and GPT-4o.

Maintenance & Community

  • Active development by the Qwen team, Alibaba Cloud.
  • Links to Hugging Face, ModelScope, blog, Discord, and API documentation are provided.
  • Fine-tuning code for Qwen2-VL and Qwen2.5-VL is available.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but models are available on Hugging Face and ModelScope, implying permissive usage for research. Commercial use should be verified.

Limitations & Caveats

  • Using YaRN for long text processing may negatively impact temporal and spatial localization tasks.
  • decord backend for video processing has known issues and is not actively maintained; torchcodec is recommended.
  • FlashAttention-2 requires compatible hardware and specific PyTorch dtypes (float16 or bfloat16).
Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
38
Star History
444 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
7 more.

CogVLM by zai-org

0.0%
7k
VLM for image understanding and multi-turn dialogue
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.