Condenser by luyug

Research paper code for dense retrieval pre-training

Created 4 years ago

254 stars

Top 99.1% on SourcePulse

Project Summary

This repository provides the code and pre-trained models for Condenser, a family of Transformer architectures designed for efficient dense retrieval. It targets researchers and practitioners in Natural Language Processing (NLP) and Information Retrieval (IR) looking to improve the performance of dense passage retrieval systems. The primary benefit is enhanced retrieval accuracy through specialized pre-training objectives.

How It Works

Condenser introduces novel pre-training architectures that optimize Transformer models for dense retrieval tasks. It modifies the standard Transformer by incorporating specific architectural choices and pre-training objectives, such as "late MLM" and "skip-from" layers, to better capture the nuances required for effective passage representation. This approach aims to improve retrieval performance compared to standard BERT or RoBERTa models fine-tuned for retrieval.

Quick Start & Requirements

Install: Requires PyTorch, Huggingface Transformers, Datasets, and NLTK.
Pre-trained Models: Available on Huggingface Hub (e.g., Luyu/condenser, Luyu/co-condenser-wiki, Luyu/co-condenser-marco).
Fine-tuning: Load models using transformers.AutoModel.from_pretrained().
Pre-training: Requires significant computational resources (multiple GPUs) and large datasets. Pre-processing involves tokenizing text into specific formats.
Documentation: Links to papers and related toolkits (DPR, GC-DPR, Tevatron) are provided.

Highlighted Details

Offers pre-trained "Headless Condenser" models on Huggingface Hub.
Supports fine-tuning for downstream tasks like open QA (NQ/TriviaQA) and supervised IR (MS-MARCO).
Provides toolkits (GC-DPR, Tevatron) for memory-efficient training and supervised IR.
Pre-training scripts detail distributed training setup with FP16 support.

Maintenance & Community

Developed by Luyu Gao and colleagues, with contributions cited in associated papers.
Related toolkits (GC-DPR, Tevatron) are mentioned, suggesting an active ecosystem.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that using a randomly initialized head with pre-trained weights can corrupt the model, emphasizing the importance of using provided head weights for further pre-training. Effective contrastive pre-training requires a large effective batch size, potentially necessitating the use of gradient caching techniques if GPU memory is limited.

Condenser by luyug

Explore Similar Projects

mint by dpressel

kanana by kakao

RetroMAE by staoxiao

PERT by ymcui

LinkBERT by michiyasunaga

contriever by facebookresearch

awesome-transformer-nlp by cedrickchee

EasyTransfer by alibaba

SpanBERT by facebookresearch

BERT-keras by Separius

transformers-tutorials by abhimishra91

fairseq by facebookresearch