dpr-scale by facebookresearch

Scalable training for dense retrieval models

Created 4 years ago

298 stars

Top 89.2% on SourcePulse

View on GitHub

3 Experts Love This Project

Jesse Clark

Cofounder of Marqo

Amanpreet Singh

Cofounder of Contextual AI

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This repository provides a scalable implementation for training dense retrieval models, building upon research from multiple papers including domain-matched pre-training and efficient multi-vector retrieval. It is targeted at researchers and engineers working on large-scale information retrieval systems who need to train and deploy efficient dense retrievers.

How It Works

The project leverages a PyTorch Lightning framework for distributed training and hyperparameter optimization. It supports various data formats, including lightweight JSONL for large corpora, and incorporates techniques like domain-matched pre-training and diverse augmentation strategies (DRAGON, DRAMA) to improve retriever generalization and performance.

Quick Start & Requirements

Install: PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py
Prerequisites: Python, PyTorch, Hugging Face Transformers, Hydra. Specific configurations may require GPUs (e.g., trainer.gpus=1, trainer.num_nodes=2).
Data Format: JSONL with question, positive contexts, and optional hard negatives. Lightweight format uses docidx for large corpora.
Resources: Training examples suggest multi-node, multi-GPU setups for large datasets and models.
Links: Official Docs, Pretrained Models, Datasets

Highlighted Details

Implements techniques from papers on domain-matched pre-training, salient phrase awareness, CITADEL, DRAGON, and DRAMA.
Supports training on various datasets including PAQ, Reddit, ConvAI2, DSTC7, and Ubuntu V2.
Provides scripts for generating embeddings, running retrieval, and evaluating performance.
Offers pre-trained checkpoints for BERT and RoBERTa models on PAQ and Reddit datasets.

Maintenance & Community

Developed by Facebook AI Research.
Links to relevant papers are provided. No explicit community links (Discord/Slack) are mentioned in the README.

Licensing & Compatibility

License: CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
Compatibility: The non-commercial clause restricts use in commercial products or services.

Limitations & Caveats

The CC-BY-NC 4.0 license strictly prohibits commercial use. The README focuses on training and reproduction, with less detail on deployment or inference optimization for production environments.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days