dpr-scale  by facebookresearch

Scalable training for dense retrieval models

created 3 years ago
299 stars

Top 90.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a scalable implementation for training dense retrieval models, building upon research from multiple papers including domain-matched pre-training and efficient multi-vector retrieval. It is targeted at researchers and engineers working on large-scale information retrieval systems who need to train and deploy efficient dense retrievers.

How It Works

The project leverages a PyTorch Lightning framework for distributed training and hyperparameter optimization. It supports various data formats, including lightweight JSONL for large corpora, and incorporates techniques like domain-matched pre-training and diverse augmentation strategies (DRAGON, DRAMA) to improve retriever generalization and performance.

Quick Start & Requirements

  • Install: PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py
  • Prerequisites: Python, PyTorch, Hugging Face Transformers, Hydra. Specific configurations may require GPUs (e.g., trainer.gpus=1, trainer.num_nodes=2).
  • Data Format: JSONL with question, positive contexts, and optional hard negatives. Lightweight format uses docidx for large corpora.
  • Resources: Training examples suggest multi-node, multi-GPU setups for large datasets and models.
  • Links: Official Docs, Pretrained Models, Datasets

Highlighted Details

  • Implements techniques from papers on domain-matched pre-training, salient phrase awareness, CITADEL, DRAGON, and DRAMA.
  • Supports training on various datasets including PAQ, Reddit, ConvAI2, DSTC7, and Ubuntu V2.
  • Provides scripts for generating embeddings, running retrieval, and evaluating performance.
  • Offers pre-trained checkpoints for BERT and RoBERTa models on PAQ and Reddit datasets.

Maintenance & Community

  • Developed by Facebook AI Research.
  • Links to relevant papers are provided. No explicit community links (Discord/Slack) are mentioned in the README.

Licensing & Compatibility

  • License: CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
  • Compatibility: The non-commercial clause restricts use in commercial products or services.

Limitations & Caveats

The CC-BY-NC 4.0 license strictly prohibits commercial use. The README focuses on training and reproduction, with less detail on deployment or inference optimization for production environments.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.