Qwen-VL  by QwenLM

Vision-language model for multimodal understanding, localization, and text reading

created 1 year ago
6,129 stars

Top 8.6% on sourcepulse

GitHubView on GitHub
Project Summary

Qwen-VL is an open-source vision-language model from Alibaba Cloud, designed for multimodal understanding tasks. It supports image and text inputs, outputting text and bounding boxes, and excels in detailed image analysis, text recognition, and multi-image conversations.

How It Works

Qwen-VL utilizes a Qwen-7B base LLM and an Openclip ViT-bigG visual encoder, connected via a cross-attention layer. This architecture enables fine-grained recognition and understanding, processing images at 448x448 resolution, which surpasses typical 224x224 resolutions of other models. The Qwen-VL-Chat variant is further aligned for conversational capabilities.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.8+, PyTorch 1.12+ (2.0+ recommended), CUDA 11.4+ (for GPU).
  • Usage: Examples provided for 🤗 Transformers and 🤖 ModelScope.
  • Docs: TUTORIAL.md

Highlighted Details

  • Qwen-VL-Max and Qwen-VL-Plus models achieve performance on par with Gemini Ultra and GPT-4V on multiple text-image multimodal tasks.
  • Qwen-VL-Max outperforms GPT-4V and Gemini on Chinese question answering and text comprehension.
  • Achieves SOTA on benchmarks like MME, Seed-Bench, and various referring expression comprehension tasks.
  • Supports fine-tuning via full-parameter, LoRA, and Q-LoRA methods.
  • Offers an Int4 quantized version (Qwen/Qwen-VL-Chat-Int4) with improved inference speed and reduced memory usage.

Maintenance & Community

  • Active development by Alibaba Cloud.
  • Links to Web UI, APP, API, WeChat, Discord, and paper available.

Licensing & Compatibility

  • Released under a permissive license allowing free use for research and commercial purposes.

Limitations & Caveats

  • While Qwen-VL can generalize to Chinese grounding tasks zero-shot, it was not explicitly trained on Chinese grounding data.
  • Fine-tuning Q-LoRA requires specific configurations and does not support merging adapters into a standalone model.
Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
330 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.