Scalable training for dense retrieval models
Top 90.0% on sourcepulse
This repository provides a scalable implementation for training dense retrieval models, building upon research from multiple papers including domain-matched pre-training and efficient multi-vector retrieval. It is targeted at researchers and engineers working on large-scale information retrieval systems who need to train and deploy efficient dense retrievers.
How It Works
The project leverages a PyTorch Lightning framework for distributed training and hyperparameter optimization. It supports various data formats, including lightweight JSONL for large corpora, and incorporates techniques like domain-matched pre-training and diverse augmentation strategies (DRAGON, DRAMA) to improve retriever generalization and performance.
Quick Start & Requirements
PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py
trainer.gpus=1
, trainer.num_nodes=2
).docidx
for large corpora.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The CC-BY-NC 4.0 license strictly prohibits commercial use. The README focuses on training and reproduction, with less detail on deployment or inference optimization for production environments.
1 month ago
Inactive