contriever  by facebookresearch

Unsupervised dense information retrieval via contrastive learning

Created 3 years ago
756 stars

Top 46.0% on SourcePulse

GitHubView on GitHub
Project Summary

Contriever is an open-source library for unsupervised dense information retrieval, offering pre-trained models and code for training and evaluation. It targets researchers and practitioners in NLP and information retrieval, enabling competitive retrieval performance without supervised data.

How It Works

Contriever employs a contrastive learning framework to pre-train models for information retrieval. It leverages a simple contrastive loss function to learn dense representations from text, allowing for efficient similarity comparisons via dot products between embeddings. This unsupervised approach makes it competitive with traditional methods like BM25 and enables strong performance, particularly in recall metrics.

Quick Start & Requirements

  • Install via HuggingFace transformers:
    from src.contriever import Contriever
    from transformers import AutoTokenizer
    contriever = Contriever.from_pretrained("facebook/contriever")
    tokenizer = AutoTokenizer.from_pretrained("facebook/contriever")
    
  • Requires PyTorch and HuggingFace transformers.
  • Pre-trained models are available for English (contriever, contriever-msmarco) and multilingual (mcontriever, mcontriever-msmarco).
  • See HuggingFace Hub for model details.

Highlighted Details

  • Competitive with BM25 on the BEIR benchmark without supervision.
  • Achieves strong recall at 100 after fine-tuning on MSMARCO.
  • Offers multilingual and cross-lingual retrieval capabilities with mContriever.
  • Provides pre-computed passage embeddings for faster evaluation.

Maintenance & Community

  • Developed by Facebook AI Research.
  • Codebase appears actively maintained.
  • Citation details provided for academic use.

Licensing & Compatibility

  • Released under a permissive license (LICENSE file not detailed in README).
  • Compatible with commercial use and closed-source linking, assuming standard MIT/Apache-like terms.

Limitations & Caveats

  • Training requires significant computational resources (e.g., 32 GPUs).
  • Evaluation scripts assume specific data formats and download locations.
  • Touche-2020 dataset results may differ due to updates.
Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.