Condenser  by luyug

Research paper code for dense retrieval pre-training

created 4 years ago
251 stars

Top 99.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the code and pre-trained models for Condenser, a family of Transformer architectures designed for efficient dense retrieval. It targets researchers and practitioners in Natural Language Processing (NLP) and Information Retrieval (IR) looking to improve the performance of dense passage retrieval systems. The primary benefit is enhanced retrieval accuracy through specialized pre-training objectives.

How It Works

Condenser introduces novel pre-training architectures that optimize Transformer models for dense retrieval tasks. It modifies the standard Transformer by incorporating specific architectural choices and pre-training objectives, such as "late MLM" and "skip-from" layers, to better capture the nuances required for effective passage representation. This approach aims to improve retrieval performance compared to standard BERT or RoBERTa models fine-tuned for retrieval.

Quick Start & Requirements

  • Install: Requires PyTorch, Huggingface Transformers, Datasets, and NLTK.
  • Pre-trained Models: Available on Huggingface Hub (e.g., Luyu/condenser, Luyu/co-condenser-wiki, Luyu/co-condenser-marco).
  • Fine-tuning: Load models using transformers.AutoModel.from_pretrained().
  • Pre-training: Requires significant computational resources (multiple GPUs) and large datasets. Pre-processing involves tokenizing text into specific formats.
  • Documentation: Links to papers and related toolkits (DPR, GC-DPR, Tevatron) are provided.

Highlighted Details

  • Offers pre-trained "Headless Condenser" models on Huggingface Hub.
  • Supports fine-tuning for downstream tasks like open QA (NQ/TriviaQA) and supervised IR (MS-MARCO).
  • Provides toolkits (GC-DPR, Tevatron) for memory-efficient training and supervised IR.
  • Pre-training scripts detail distributed training setup with FP16 support.

Maintenance & Community

  • Developed by Luyu Gao and colleagues, with contributions cited in associated papers.
  • Related toolkits (GC-DPR, Tevatron) are mentioned, suggesting an active ecosystem.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that using a randomly initialized head with pre-trained weights can corrupt the model, emphasizing the importance of using provided head weights for further pre-training. Effective contrastive pre-training requires a large effective batch size, potentially necessitating the use of gradient caching techniques if GPU memory is limited.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.