Qwen-VL by QwenLM

Vision-language model for multimodal understanding, localization, and text reading

Created 2 years ago

6,469 stars

Top 7.8% on SourcePulse

View on GitHub

6 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Elvis Saravia

Founder of DAIR.AI

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Binyuan Hui

Research Scientist at Alibaba Qwen

and 2 more!

Project Summary

Qwen-VL is an open-source vision-language model from Alibaba Cloud, designed for multimodal understanding tasks. It supports image and text inputs, outputting text and bounding boxes, and excels in detailed image analysis, text recognition, and multi-image conversations.

How It Works

Qwen-VL utilizes a Qwen-7B base LLM and an Openclip ViT-bigG visual encoder, connected via a cross-attention layer. This architecture enables fine-grained recognition and understanding, processing images at 448x448 resolution, which surpasses typical 224x224 resolutions of other models. The Qwen-VL-Chat variant is further aligned for conversational capabilities.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.8+, PyTorch 1.12+ (2.0+ recommended), CUDA 11.4+ (for GPU).
Usage: Examples provided for 🤗 Transformers and 🤖 ModelScope.
Docs: TUTORIAL.md

Highlighted Details

Qwen-VL-Max and Qwen-VL-Plus models achieve performance on par with Gemini Ultra and GPT-4V on multiple text-image multimodal tasks.
Qwen-VL-Max outperforms GPT-4V and Gemini on Chinese question answering and text comprehension.
Achieves SOTA on benchmarks like MME, Seed-Bench, and various referring expression comprehension tasks.
Supports fine-tuning via full-parameter, LoRA, and Q-LoRA methods.
Offers an Int4 quantized version (Qwen/Qwen-VL-Chat-Int4) with improved inference speed and reduced memory usage.

Maintenance & Community

Active development by Alibaba Cloud.
Links to Web UI, APP, API, WeChat, Discord, and paper available.

Licensing & Compatibility

Released under a permissive license allowing free use for research and commercial purposes.

Limitations & Caveats

While Qwen-VL can generalize to Chinese grounding tasks zero-shot, it was not explicitly trained on Chinese grounding data.
Fine-tuning Q-LoRA requires specific configurations and does not support merging adapters into a standalone model.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

56 stars in the last 30 days