ANCE by microsoft

Embedding training algorithm for text retrieval research

Created 5 years ago

380 stars

Top 75.1% on SourcePulse

Project Summary

This repository provides the implementation for Approximate Nearest Neighbor Negative Contrastive Estimation (ANCE), a novel training algorithm for dense text retrieval. It addresses the discrepancy between training and testing data distributions in dense retrieval by dynamically constructing negative samples from an Approximate Nearest Neighbor (ANN) index of the corpus. This approach significantly boosts retrieval performance and efficiency, making it suitable for researchers and practitioners in information retrieval and natural language processing.

How It Works

ANCE trains dense retrieval models by incorporating negative samples that are more representative of irrelevant documents encountered during testing. It achieves this by building an ANN index of the corpus, which is asynchronously updated alongside the model's learning process. This dynamic index allows ANCE to select more realistic negative instances, thereby resolving the training-testing data mismatch and leading to state-of-the-art retrieval accuracy with substantial speedups.

Quick Start & Requirements

Install: git clone https://github.com/microsoft/ANCE && cd ANCE && python setup.py install
Data Download: bash commands/data_download.sh
Prerequisites: Python, PyTorch, Transformers, FAISS, CUDA (implied for performance). Specific model types (rdot_nll, rdot_nll_multi_chunk) and sequence lengths (512, 2048) are configurable.
Setup: Requires downloading and preprocessing data, followed by a two-stage training process (warmup and main training) with parallel ANN data generation.
Links: Official GitHub Repo

Highlighted Details

Achieved state-of-the-art retrieval on Trec DL 2019 and OpenQA benchmarks.
Offers nearly 100x speed-up compared to BERT-reranking methods.
Demonstrates the advantage of asynchronous ANN refreshing for learning convergence.
Provides code for SEED-Encoder fine-tuning.

Maintenance & Community

The project is from Microsoft Research. Specific community channels or active maintenance status are not detailed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README notes that reproducing exact results may differ due to synchronization and environment variations. The training process involves complex, parallelized steps for ANN data generation and model training.

ANCE by microsoft

Explore Similar Projects

Luotuo-Text-Embedding by LC1332

dpr-scale by facebookresearch

denser-retriever by denser-org

DeepCT by AdeDZY

awesome-pretrained-models-for-information-retrieval by ict-bigdatalab

pylate by lightonai

pyterrier by terrier-org

atlas by facebookresearch

beir by beir-cellar

ColBERT by stanford-futuredata

FlagEmbedding by FlagOpen

sentence-transformers by huggingface