DeepCT by AdeDZY

Term weighting framework for first-stage retrieval using BERT

Created 6 years ago

324 stars

Top 84.1% on SourcePulse

Project Summary

DeepCT and HDCT offer a framework for generating context-aware term weights for documents and queries, improving first-stage retrieval performance. The approach leverages BERT's contextualized embeddings to learn term importance, addressing the limitations of traditional term frequency methods. This is beneficial for researchers and practitioners in information retrieval seeking more sophisticated ranking signals.

How It Works

The core idea is to map BERT's contextual text representations to term weights. DeepCT processes sentences or passages, producing weights that can be integrated into standard inverted indexes. HDCT extends this to handle longer documents and supports weakly-supervised training, making it more scalable. This method aims to capture semantic importance beyond simple word counts.

Quick Start & Requirements

Install/Run: Requires Python 3 and TensorFlow 1.15.0. Training and inference scripts (run_deepct.py) are provided.
Prerequisites: Uncased BERT base model (bert_base_dir), training data (TRAIN_DATA_FILE), and optionally pre-trained checkpoints.
Data: The repository links to pre-computed weights and processed data for MS MARCO Passage Ranking, simplifying reproduction.
Links: arXiv paper, WebConf2020 paper, MS-MARCO rankings

Highlighted Details

Generates floating-point term weights ($y_{t,d}$) that can be scaled to integer TF-like values (e.g., round(y * 100) or round(sqrt(y) * 100)).
Provides fine-tuned BM25 parameters for optimal performance with DeepCT-indexed data.
Offers pre-computed weights and processed data for MS MARCO Passage Ranking to facilitate immediate use.
Supports integration with indexing tools like Anserini, Indri, and Lucene.

Maintenance & Community

The project originates from research at CMU. Specific community channels or active maintenance status are not detailed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code is provided for research purposes, and commercial use would require clarification.

Limitations & Caveats

The code is dependent on TensorFlow 1.15.0, which is an older version and may present compatibility challenges with current environments. The training process requires significant data preparation and computational resources.

DeepCT by AdeDZY

Explore Similar Projects

PERT by ymcui

Condenser by luyug

awesome-semantic-search by Agrover112

dpr-scale by facebookresearch

ANCE by microsoft

pyterrier by terrier-org

atlas by facebookresearch

BERT-for-Sequence-Labeling-and-Text-Classification by yuanxiaosc

KeyBERT by MaartenGr

ColBERT by stanford-futuredata

pyserini by castorini

sentence-transformers by huggingface