contriever  by facebookresearch

Unsupervised dense information retrieval via contrastive learning

created 3 years ago
749 stars

Top 47.3% on sourcepulse

GitHubView on GitHub
Project Summary

Contriever is an open-source library for unsupervised dense information retrieval, offering pre-trained models and code for training and evaluation. It targets researchers and practitioners in NLP and information retrieval, enabling competitive retrieval performance without supervised data.

How It Works

Contriever employs a contrastive learning framework to pre-train models for information retrieval. It leverages a simple contrastive loss function to learn dense representations from text, allowing for efficient similarity comparisons via dot products between embeddings. This unsupervised approach makes it competitive with traditional methods like BM25 and enables strong performance, particularly in recall metrics.

Quick Start & Requirements

  • Install via HuggingFace transformers:
    from src.contriever import Contriever
    from transformers import AutoTokenizer
    contriever = Contriever.from_pretrained("facebook/contriever")
    tokenizer = AutoTokenizer.from_pretrained("facebook/contriever")
    
  • Requires PyTorch and HuggingFace transformers.
  • Pre-trained models are available for English (contriever, contriever-msmarco) and multilingual (mcontriever, mcontriever-msmarco).
  • See HuggingFace Hub for model details.

Highlighted Details

  • Competitive with BM25 on the BEIR benchmark without supervision.
  • Achieves strong recall at 100 after fine-tuning on MSMARCO.
  • Offers multilingual and cross-lingual retrieval capabilities with mContriever.
  • Provides pre-computed passage embeddings for faster evaluation.

Maintenance & Community

  • Developed by Facebook AI Research.
  • Codebase appears actively maintained.
  • Citation details provided for academic use.

Licensing & Compatibility

  • Released under a permissive license (LICENSE file not detailed in README).
  • Compatible with commercial use and closed-source linking, assuming standard MIT/Apache-like terms.

Limitations & Caveats

  • Training requires significant computational resources (e.g., 32 GPUs).
  • Evaluation scripts assume specific data formats and download locations.
  • Touche-2020 dataset results may differ due to updates.
Health Check
Last commit

2 years ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 3 days ago
Feedback? Help us improve.