Embedding training algorithm for text retrieval research
Top 76.9% on sourcepulse
This repository provides the implementation for Approximate Nearest Neighbor Negative Contrastive Estimation (ANCE), a novel training algorithm for dense text retrieval. It addresses the discrepancy between training and testing data distributions in dense retrieval by dynamically constructing negative samples from an Approximate Nearest Neighbor (ANN) index of the corpus. This approach significantly boosts retrieval performance and efficiency, making it suitable for researchers and practitioners in information retrieval and natural language processing.
How It Works
ANCE trains dense retrieval models by incorporating negative samples that are more representative of irrelevant documents encountered during testing. It achieves this by building an ANN index of the corpus, which is asynchronously updated alongside the model's learning process. This dynamic index allows ANCE to select more realistic negative instances, thereby resolving the training-testing data mismatch and leading to state-of-the-art retrieval accuracy with substantial speedups.
Quick Start & Requirements
git clone https://github.com/microsoft/ANCE && cd ANCE && python setup.py install
bash commands/data_download.sh
rdot_nll
, rdot_nll_multi_chunk
) and sequence lengths (512, 2048) are configurable.Highlighted Details
Maintenance & Community
The project is from Microsoft Research. Specific community channels or active maintenance status are not detailed in the README.
Licensing & Compatibility
The repository does not explicitly state a license. Users should verify licensing for commercial or closed-source use.
Limitations & Caveats
The README notes that reproducing exact results may differ due to synchronization and environment variations. The training process involves complex, parallelized steps for ANN data generation and model training.
2 years ago
Inactive