RetroMAE by staoxiao

Code for retrieval-oriented language model pre-training via masked auto-encoders

Created 3 years ago

271 stars

Top 95.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Stas Bekman

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

Project Summary

RetroMAE provides a codebase for pre-training and fine-tuning retrieval-oriented language models using a Masked Auto-Encoder approach. It targets researchers and practitioners in information retrieval and natural language processing, offering state-of-the-art performance on benchmarks like MS MARCO and BEIR.

How It Works

RetroMAE employs a Masked Auto-Encoder (MAE) strategy for pre-training, which reconstructs masked tokens. This approach, particularly in its v2 iteration (Duplex MAE), is designed to enhance the transferability and zero-shot capabilities of dense retrievers, leading to improved performance on both in-domain and out-of-domain datasets.

Quick Start & Requirements

Install via pip: pip install . or pip install -e . for development.
Requires PyTorch.
Pre-trained models are available on Huggingface Hub (e.g., Shitao/RetroMAE).
Example workflows for pre-training and fine-tuning are provided.

Highlighted Details

Achieves SOTA performance on MS MARCO and BEIR benchmarks.
Offers improved zero-shot performance on out-of-domain datasets.
Supports fine-tuning via distillation from cross-encoders.
RetroMAE v2 is available on arXiv.

Maintenance & Community

The project is associated with EMNLP 2022.
Citation details are provided.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

The project is primarily focused on PyTorch and may require adaptation for other frameworks.
Specific hardware requirements for pre-training (e.g., multiple GPUs) are implied by the torchrun commands.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days