Qwen3-VL-Embedding  by QwenLM

State-of-the-art multimodal embedding and reranking for information retrieval

Created 3 days ago

New!

456 stars

Top 66.2% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Qwen3-VL-Embedding and Qwen3-VL-Reranker provide state-of-the-art multimodal embedding and reranking, built on Qwen3-VL. They enable advanced information retrieval and cross-modal understanding by processing text, images, screenshots, and videos within a unified framework. Offering a shared representation space and precise reranking, these models enhance retrieval accuracy across over 30 languages.

How It Works

The Embedding model uses a dual-tower architecture to map diverse inputs into a high-dimensional semantic vector, suitable for efficient, large-scale retrieval. The Reranking model employs a single-tower architecture with Cross-Attention for deep inter-modal fusion, precisely scoring relevance for query-document pairs to refine initial recall. This tandem approach optimizes both recall and precision.

Quick Start & Requirements

Installation involves cloning the repository and running scripts/setup_environment.sh for dependency setup. Models are available on Hugging Face and ModelScope. Usage examples cover standard Transformers and vLLM integration. Specific hardware requirements (GPU, VRAM) are not detailed, though model sizes (2B, 8B) suggest significant needs. Flash Attention 2 is recommended for acceleration.

Highlighted Details

  • Multimodal Versatility: Handles text, images, screenshots, and video inputs for tasks like image-text retrieval and VQA.
  • Unified Representation: Generates semantically rich vectors in a shared space for cross-modal similarity.
  • High-Precision Reranking: Precisely scores relevance for arbitrary single or mixed-modal query-document pairs.
  • Global Applicability: Supports over 30 languages, with customizable instructions and flexible vector dimensions (MRL).
  • Efficiency: Quantization support is available for deployment.

Maintenance & Community

No specific community channels or detailed maintenance information are provided. The project appears research-driven, with authors listed in the citation.

Licensing & Compatibility

The README does not specify the software license. This lack of clarity is a significant adoption blocker, leaving terms for commercial use or integration with closed-source projects undefined.

Limitations & Caveats

Detailed Transformers usage examples for the Reranker are marked "Coming soon." Specific hardware requirements and comprehensive benchmarks beyond provided tables are not elaborated. The absence of a specified license is a critical caveat.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
12
Star History
496 stars in the last 3 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
470
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 2 years ago
Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Chenlin Meng Chenlin Meng(Cofounder of Pika), and
9 more.

clip-retrieval by rom1504

0.1%
3k
CLIP retrieval system for semantic search
Created 4 years ago
Updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.1%
11k
Library for language-vision AI research
Created 3 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

RAG-Anything by HKUDS

1.8%
12k
All-in-one multimodal RAG system
Created 7 months ago
Updated 5 days ago
Feedback? Help us improve.