Discover and explore top open-source AI tools and projects—updated daily.
QwenLMVision-language model for multimodal understanding, localization, and text reading
Top 8.1% on SourcePulse
Qwen-VL is an open-source vision-language model from Alibaba Cloud, designed for multimodal understanding tasks. It supports image and text inputs, outputting text and bounding boxes, and excels in detailed image analysis, text recognition, and multi-image conversations.
How It Works
Qwen-VL utilizes a Qwen-7B base LLM and an Openclip ViT-bigG visual encoder, connected via a cross-attention layer. This architecture enables fine-grained recognition and understanding, processing images at 448x448 resolution, which surpasses typical 224x224 resolutions of other models. The Qwen-VL-Chat variant is further aligned for conversational capabilities.
Quick Start & Requirements
pip install -r requirements.txtHighlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
Inactive
unum-cloud
zai-org
InternLM
huggingface
deepseek-ai