ColBERT  by stanford-futuredata

Neural search for fast, accurate retrieval over large text collections

created 5 years ago
3,519 stars

Top 14.0% on sourcepulse

GitHubView on GitHub
Project Summary

ColBERT is a state-of-the-art neural retrieval model designed for fast and accurate semantic search over large text collections. It targets researchers and engineers building information retrieval systems, offering BERT-based search capabilities that operate in tens of milliseconds. The core benefit is achieving high relevance while maintaining scalability.

How It Works

ColBERT employs a "contextualized late interaction" approach. It encodes each passage into a matrix of token-level embeddings. During search, queries are also embedded into matrices. Relevance is then calculated efficiently using scalable vector-similarity (MaxSim) operators that capture fine-grained interactions between query and passage tokens. This method surpasses single-vector models in quality and scales effectively.

Quick Start & Requirements

  • Install via pip: pip install colbert-ai[torch,faiss-gpu] (conda recommended for FAISS/PyTorch).
  • Prerequisites: Python 3.7+, PyTorch 1.9+, Hugging Face Transformers. GPU is required for training and indexing; CPU inference is supported.
  • Setup: Conda environment creation is recommended.
  • Resources: Indexing 10,000 passages on a free Colab T4 GPU takes ~6 minutes.
  • Docs: API Usage Notebook

Highlighted Details

  • State-of-the-art performance across multiple SIGIR, TACL, NeurIPS, NAACL, CIKM, ACL, and EMNLP papers.
  • Supports training custom ColBERT models and provides pre-trained checkpoints (e.g., ColBERTv2.0 trained on MS MARCO).
  • Offers a lightweight server for API-based search.
  • Integrates with frameworks like DSPy.

Maintenance & Community

  • Actively developed by Stanford University researchers.
  • Mentions integration with the growing RAGatouille library.
  • Links to DSPy framework provided.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing terms.

Limitations & Caveats

  • While CPU inference is supported, GPU is strongly recommended for indexing and training.
  • Some branches are marked as deprecated, indicating potential shifts in the project's focus or implementation.
Health Check
Last commit

5 days ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
1
Star History
176 stars in the last 90 days

Explore Similar Projects

Starred by Jason Liu Jason Liu(Author of Instructor) and Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code).

Search-R1 by PeterGriffinJin

1.3%
3k
RL framework for training LLMs to use search engines
created 5 months ago
updated 3 weeks ago
Feedback? Help us improve.