colpali  by illuin-tech

Vision-language model code for document retrieval research

Created 1 year ago
2,439 stars

Top 18.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the code for training and running inference with ColVision models, such as ColPali, ColQwen2, and ColSmol, which are designed for efficient document retrieval by leveraging Vision Language Models (VLMs). It targets researchers and developers working on multimodal information retrieval, offering a unified approach that considers both textual and visual document content, thereby eliminating the need for separate OCR and layout analysis pipelines.

How It Works

ColPali constructs multi-vector embeddings from a document's visual features by feeding the output patches from a VLM (like PaliGemma-3B) through a linear projection. This approach, inspired by the ColBERT method, maximizes similarity between document and query embeddings. This allows a single model to process visual elements like layout and charts alongside text, simplifying the retrieval pipeline.

Quick Start & Requirements

  • Install via pip: pip install colpali-engine or pip install git+https://github.com/illuin-tech/colpali for source.
  • Python >=3.9, recent PyTorch versions.
  • For ColQwen models on Mac with MPS, downgrade PyTorch to 2.5.1 if using torch 2.6.0.
  • GPU recommended for inference and required for training.
  • Official Docs: ColPali Paper, ViDoRe Leaderboard, Demo

Highlighted Details

  • Achieves state-of-the-art results on the ViDoRe leaderboard, with ColQwen2.5-v0.2 reaching 89.4.
  • Offers interpretability features to visualize salient image patches relevant to query terms via similarity maps.
  • Implements token pooling (e.g., HierarchicalTokenPooler) to reduce embedding sequence length by up to 66.7% while retaining 97.8% performance.
  • Supports training with accelerate and SLURM cluster configurations.

Maintenance & Community

The project is actively maintained by illuin-tech. A vibrant community has developed around ColPali, with numerous libraries and tutorials integrating with vector databases like Vespa, Qdrant, Elasticsearch, and Weaviate. Community resources include extensive cookbooks and notebooks.

Licensing & Compatibility

Models based on Gemma are licensed under the Gemma license. Models based on Qwen and SmolVLM are under Apache 2.0. Compatibility for commercial use depends on the base model's license.

Limitations & Caveats

The README notes potential issues with PyTorch 2.6.0 on Mac MPS devices for ColQwen models, requiring a downgrade. Reproducing paper results requires checking out tag v0.1.1 or installing colpali-engine==0.1.1.

Health Check
Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
14
Issues (30d)
8
Star History
65 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.