contriever by facebookresearch

Unsupervised dense information retrieval via contrastive learning

Created 4 years ago

767 stars

Top 45.5% on SourcePulse

3 Experts Love This Project

winglian

Founder of Axolotl AI

transitive-bullshit

Founder of Agentic

hammer

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Contriever is an open-source library for unsupervised dense information retrieval, offering pre-trained models and code for training and evaluation. It targets researchers and practitioners in NLP and information retrieval, enabling competitive retrieval performance without supervised data.

How It Works

Contriever employs a contrastive learning framework to pre-train models for information retrieval. It leverages a simple contrastive loss function to learn dense representations from text, allowing for efficient similarity comparisons via dot products between embeddings. This unsupervised approach makes it competitive with traditional methods like BM25 and enables strong performance, particularly in recall metrics.

Quick Start & Requirements

Install via HuggingFace transformers:

from src.contriever import Contriever
from transformers import AutoTokenizer
contriever = Contriever.from_pretrained("facebook/contriever")
tokenizer = AutoTokenizer.from_pretrained("facebook/contriever")

Requires PyTorch and HuggingFace transformers.
Pre-trained models are available for English (contriever, contriever-msmarco) and multilingual (mcontriever, mcontriever-msmarco).
See HuggingFace Hub for model details.

Highlighted Details

Competitive with BM25 on the BEIR benchmark without supervision.
Achieves strong recall at 100 after fine-tuning on MSMARCO.
Offers multilingual and cross-lingual retrieval capabilities with mContriever.
Provides pre-computed passage embeddings for faster evaluation.

Maintenance & Community

Developed by Facebook AI Research.
Codebase appears actively maintained.
Citation details provided for academic use.

Licensing & Compatibility

Released under a permissive license (LICENSE file not detailed in README).
Compatible with commercial use and closed-source linking, assuming standard MIT/Apache-like terms.

Limitations & Caveats

Training requires significant computational resources (e.g., 32 GPUs).
Evaluation scripts assume specific data formats and download locations.
Touche-2020 dataset results may differ due to updates.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

3 stars in the last 30 days

Explore Similar Projects

Condenser by luyug

Research paper code for dense retrieval pre-training

Created 4 years ago

Updated 3 years ago

awesome-semantic-search by Agrover112

Semantic search resource list

Created 4 years ago

Updated 1 month ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo),

Amanpreet Singh

Amanpreet Singh(Cofounder of Contextual AI), and

1 more.

dpr-scale by facebookresearch

Scalable training for dense retrieval models

Created 4 years ago

Updated 7 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

LinkBERT by michiyasunaga

Knowledgeable language model pretrained with document links

Created 3 years ago

Updated 3 years ago

ANCE by microsoft

Embedding training algorithm for text retrieval research

Created 5 years ago

Updated 5 days ago

Starred by

Philipp Schmid

Philipp Schmid(DevRel at Google DeepMind),

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind), and

4 more.

sgpt by Muennighoff

GPT models for semantic search, code, and pretrained models

Created 3 years ago

Updated 1 year ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Omar Khattab

Omar Khattab(Coauthor of DSPy, ColBERT; Professor at MIT), and

3 more.

pylate by lightonai

PyLate: library for late interaction model training and retrieval

Created 1 year ago

Updated 4 days ago

Starred by

Chaoyu Yang

Chaoyu Yang(Founder of Bento),

Travis Fischer

Travis Fischer(Founder of Agentic), and

1 more.

atlas by facebookresearch

Research code for retrieval-augmented language models

Created 3 years ago

Updated 2 years ago

Starred by

Bryan Helmig

Bryan Helmig(Cofounder of Zapier),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

1 more.

finetune-embedding by run-llama

Embedding finetuning for RAG research paper

Created 2 years ago

Updated 2 years ago

Starred by

Amin Ahmad

Amin Ahmad(Cofounder of Vectara),

Jesse Clark

Jesse Clark(Cofounder of Marqo), and

1 more.

tevatron by texttron

Unified toolkit for document retrieval across modalities, languages, and scale

Created 4 years ago

Updated 3 weeks ago

Starred by

Lilian Weng

Lilian Weng(Cofounder of Thinking Machines Lab),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

7 more.

DPR by facebookresearch

Dense Passage Retriever for open-domain Q&A research

Created 5 years ago

Updated 2 years ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Elie Bursztein

Elie Bursztein(Cybersecurity Lead at Google DeepMind), and

11 more.

FlagEmbedding by FlagOpen

Toolkit for retrieval and RAG applications

Created 2 years ago

Updated 3 weeks ago

Feedback? Help us improve.