RetroMAE  by staoxiao

Code for retrieval-oriented language model pre-training via masked auto-encoders

Created 2 years ago
267 stars

Top 95.9% on SourcePulse

GitHubView on GitHub
Project Summary

RetroMAE provides a codebase for pre-training and fine-tuning retrieval-oriented language models using a Masked Auto-Encoder approach. It targets researchers and practitioners in information retrieval and natural language processing, offering state-of-the-art performance on benchmarks like MS MARCO and BEIR.

How It Works

RetroMAE employs a Masked Auto-Encoder (MAE) strategy for pre-training, which reconstructs masked tokens. This approach, particularly in its v2 iteration (Duplex MAE), is designed to enhance the transferability and zero-shot capabilities of dense retrievers, leading to improved performance on both in-domain and out-of-domain datasets.

Quick Start & Requirements

  • Install via pip: pip install . or pip install -e . for development.
  • Requires PyTorch.
  • Pre-trained models are available on Huggingface Hub (e.g., Shitao/RetroMAE).
  • Example workflows for pre-training and fine-tuning are provided.

Highlighted Details

  • Achieves SOTA performance on MS MARCO and BEIR benchmarks.
  • Offers improved zero-shot performance on out-of-domain datasets.
  • Supports fine-tuning via distillation from cross-encoders.
  • RetroMAE v2 is available on arXiv.

Maintenance & Community

  • The project is associated with EMNLP 2022.
  • Citation details are provided.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

  • The project is primarily focused on PyTorch and may require adaptation for other frameworks.
  • Specific hardware requirements for pre-training (e.g., multiple GPUs) are implied by the torchrun commands.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.