ANCE  by microsoft

Embedding training algorithm for text retrieval research

created 5 years ago
374 stars

Top 76.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the implementation for Approximate Nearest Neighbor Negative Contrastive Estimation (ANCE), a novel training algorithm for dense text retrieval. It addresses the discrepancy between training and testing data distributions in dense retrieval by dynamically constructing negative samples from an Approximate Nearest Neighbor (ANN) index of the corpus. This approach significantly boosts retrieval performance and efficiency, making it suitable for researchers and practitioners in information retrieval and natural language processing.

How It Works

ANCE trains dense retrieval models by incorporating negative samples that are more representative of irrelevant documents encountered during testing. It achieves this by building an ANN index of the corpus, which is asynchronously updated alongside the model's learning process. This dynamic index allows ANCE to select more realistic negative instances, thereby resolving the training-testing data mismatch and leading to state-of-the-art retrieval accuracy with substantial speedups.

Quick Start & Requirements

  • Install: git clone https://github.com/microsoft/ANCE && cd ANCE && python setup.py install
  • Data Download: bash commands/data_download.sh
  • Prerequisites: Python, PyTorch, Transformers, FAISS, CUDA (implied for performance). Specific model types (rdot_nll, rdot_nll_multi_chunk) and sequence lengths (512, 2048) are configurable.
  • Setup: Requires downloading and preprocessing data, followed by a two-stage training process (warmup and main training) with parallel ANN data generation.
  • Links: Official GitHub Repo

Highlighted Details

  • Achieved state-of-the-art retrieval on Trec DL 2019 and OpenQA benchmarks.
  • Offers nearly 100x speed-up compared to BERT-reranking methods.
  • Demonstrates the advantage of asynchronous ANN refreshing for learning convergence.
  • Provides code for SEED-Encoder fine-tuning.

Maintenance & Community

The project is from Microsoft Research. Specific community channels or active maintenance status are not detailed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README notes that reproducing exact results may differ due to synchronization and environment variations. The training process involves complex, parallelized steps for ANN data generation and model training.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.