Vision-language model code for document retrieval research
Top 21.9% on sourcepulse
This repository provides the code for training and running inference with ColVision models, such as ColPali, ColQwen2, and ColSmol, which are designed for efficient document retrieval by leveraging Vision Language Models (VLMs). It targets researchers and developers working on multimodal information retrieval, offering a unified approach that considers both textual and visual document content, thereby eliminating the need for separate OCR and layout analysis pipelines.
How It Works
ColPali constructs multi-vector embeddings from a document's visual features by feeding the output patches from a VLM (like PaliGemma-3B) through a linear projection. This approach, inspired by the ColBERT method, maximizes similarity between document and query embeddings. This allows a single model to process visual elements like layout and charts alongside text, simplifying the retrieval pipeline.
Quick Start & Requirements
pip install colpali-engine
or pip install git+https://github.com/illuin-tech/colpali
for source.Highlighted Details
accelerate
and SLURM cluster configurations.Maintenance & Community
The project is actively maintained by illuin-tech. A vibrant community has developed around ColPali, with numerous libraries and tutorials integrating with vector databases like Vespa, Qdrant, Elasticsearch, and Weaviate. Community resources include extensive cookbooks and notebooks.
Licensing & Compatibility
Models based on Gemma are licensed under the Gemma license. Models based on Qwen and SmolVLM are under Apache 2.0. Compatibility for commercial use depends on the base model's license.
Limitations & Caveats
The README notes potential issues with PyTorch 2.6.0 on Mac MPS devices for ColQwen models, requiring a downgrade. Reproducing paper results requires checking out tag v0.1.1 or installing colpali-engine==0.1.1
.
5 days ago
1 day