colpali  by illuin-tech

Vision-language model code for document retrieval research

created 1 year ago
2,094 stars

Top 21.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the code for training and running inference with ColVision models, such as ColPali, ColQwen2, and ColSmol, which are designed for efficient document retrieval by leveraging Vision Language Models (VLMs). It targets researchers and developers working on multimodal information retrieval, offering a unified approach that considers both textual and visual document content, thereby eliminating the need for separate OCR and layout analysis pipelines.

How It Works

ColPali constructs multi-vector embeddings from a document's visual features by feeding the output patches from a VLM (like PaliGemma-3B) through a linear projection. This approach, inspired by the ColBERT method, maximizes similarity between document and query embeddings. This allows a single model to process visual elements like layout and charts alongside text, simplifying the retrieval pipeline.

Quick Start & Requirements

  • Install via pip: pip install colpali-engine or pip install git+https://github.com/illuin-tech/colpali for source.
  • Python >=3.9, recent PyTorch versions.
  • For ColQwen models on Mac with MPS, downgrade PyTorch to 2.5.1 if using torch 2.6.0.
  • GPU recommended for inference and required for training.
  • Official Docs: ColPali Paper, ViDoRe Leaderboard, Demo

Highlighted Details

  • Achieves state-of-the-art results on the ViDoRe leaderboard, with ColQwen2.5-v0.2 reaching 89.4.
  • Offers interpretability features to visualize salient image patches relevant to query terms via similarity maps.
  • Implements token pooling (e.g., HierarchicalTokenPooler) to reduce embedding sequence length by up to 66.7% while retaining 97.8% performance.
  • Supports training with accelerate and SLURM cluster configurations.

Maintenance & Community

The project is actively maintained by illuin-tech. A vibrant community has developed around ColPali, with numerous libraries and tutorials integrating with vector databases like Vespa, Qdrant, Elasticsearch, and Weaviate. Community resources include extensive cookbooks and notebooks.

Licensing & Compatibility

Models based on Gemma are licensed under the Gemma license. Models based on Qwen and SmolVLM are under Apache 2.0. Compatibility for commercial use depends on the base model's license.

Limitations & Caveats

The README notes potential issues with PyTorch 2.6.0 on Mac MPS devices for ColQwen models, requiring a downgrade. Reproducing paper results requires checking out tag v0.1.1 or installing colpali-engine==0.1.1.

Health Check
Last commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
14
Issues (30d)
7
Star History
319 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 3 days ago
Feedback? Help us improve.