dpr-scale  by facebookresearch

Scalable training for dense retrieval models

Created 3 years ago
298 stars

Top 89.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a scalable implementation for training dense retrieval models, building upon research from multiple papers including domain-matched pre-training and efficient multi-vector retrieval. It is targeted at researchers and engineers working on large-scale information retrieval systems who need to train and deploy efficient dense retrievers.

How It Works

The project leverages a PyTorch Lightning framework for distributed training and hyperparameter optimization. It supports various data formats, including lightweight JSONL for large corpora, and incorporates techniques like domain-matched pre-training and diverse augmentation strategies (DRAGON, DRAMA) to improve retriever generalization and performance.

Quick Start & Requirements

  • Install: PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py
  • Prerequisites: Python, PyTorch, Hugging Face Transformers, Hydra. Specific configurations may require GPUs (e.g., trainer.gpus=1, trainer.num_nodes=2).
  • Data Format: JSONL with question, positive contexts, and optional hard negatives. Lightweight format uses docidx for large corpora.
  • Resources: Training examples suggest multi-node, multi-GPU setups for large datasets and models.
  • Links: Official Docs, Pretrained Models, Datasets

Highlighted Details

  • Implements techniques from papers on domain-matched pre-training, salient phrase awareness, CITADEL, DRAGON, and DRAMA.
  • Supports training on various datasets including PAQ, Reddit, ConvAI2, DSTC7, and Ubuntu V2.
  • Provides scripts for generating embeddings, running retrieval, and evaluating performance.
  • Offers pre-trained checkpoints for BERT and RoBERTa models on PAQ and Reddit datasets.

Maintenance & Community

  • Developed by Facebook AI Research.
  • Links to relevant papers are provided. No explicit community links (Discord/Slack) are mentioned in the README.

Licensing & Compatibility

  • License: CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
  • Compatibility: The non-commercial clause restricts use in commercial products or services.

Limitations & Caveats

The CC-BY-NC 4.0 license strictly prohibits commercial use. The README focuses on training and reproduction, with less detail on deployment or inference optimization for production environments.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
23 more.

sentence-transformers by UKPLab

0.3%
18k
Framework for text embeddings, retrieval, and reranking
Created 6 years ago
Updated 3 days ago
Feedback? Help us improve.