ColBERT by stanford-futuredata

Neural search for fast, accurate retrieval over large text collections

Created 5 years ago

3,748 stars

Top 12.8% on SourcePulse

View on GitHub

13 Experts Love This Project

Daniel Gross

Cofounder of Safe Superintelligence

Matei Zaharia

Cofounder of Databricks

Vincent Weisser

Cofounder of Prime Intellect

Pawel Garbacki

Cofounder of Fireworks AI

and 9 more!

Project Summary

ColBERT is a state-of-the-art neural retrieval model designed for fast and accurate semantic search over large text collections. It targets researchers and engineers building information retrieval systems, offering BERT-based search capabilities that operate in tens of milliseconds. The core benefit is achieving high relevance while maintaining scalability.

How It Works

ColBERT employs a "contextualized late interaction" approach. It encodes each passage into a matrix of token-level embeddings. During search, queries are also embedded into matrices. Relevance is then calculated efficiently using scalable vector-similarity (MaxSim) operators that capture fine-grained interactions between query and passage tokens. This method surpasses single-vector models in quality and scales effectively.

Quick Start & Requirements

Install via pip: pip install colbert-ai[torch,faiss-gpu] (conda recommended for FAISS/PyTorch).
Prerequisites: Python 3.7+, PyTorch 1.9+, Hugging Face Transformers. GPU is required for training and indexing; CPU inference is supported.
Setup: Conda environment creation is recommended.
Resources: Indexing 10,000 passages on a free Colab T4 GPU takes ~6 minutes.
Docs: API Usage Notebook

Highlighted Details

State-of-the-art performance across multiple SIGIR, TACL, NeurIPS, NAACL, CIKM, ACL, and EMNLP papers.
Supports training custom ColBERT models and provides pre-trained checkpoints (e.g., ColBERTv2.0 trained on MS MARCO).
Offers a lightweight server for API-based search.
Integrates with frameworks like DSPy.

Maintenance & Community

Actively developed by Stanford University researchers.
Mentions integration with the growing RAGatouille library.
Links to DSPy framework provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing terms.

Limitations & Caveats

While CPU inference is supported, GPU is strongly recommended for indexing and training.
Some branches are marked as deprecated, indicating potential shifts in the project's focus or implementation.

Health Check

Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

23 stars in the last 30 days