DeepCT  by AdeDZY

Term weighting framework for first-stage retrieval using BERT

created 5 years ago
321 stars

Top 85.7% on sourcepulse

GitHubView on GitHub
Project Summary

DeepCT and HDCT offer a framework for generating context-aware term weights for documents and queries, improving first-stage retrieval performance. The approach leverages BERT's contextualized embeddings to learn term importance, addressing the limitations of traditional term frequency methods. This is beneficial for researchers and practitioners in information retrieval seeking more sophisticated ranking signals.

How It Works

The core idea is to map BERT's contextual text representations to term weights. DeepCT processes sentences or passages, producing weights that can be integrated into standard inverted indexes. HDCT extends this to handle longer documents and supports weakly-supervised training, making it more scalable. This method aims to capture semantic importance beyond simple word counts.

Quick Start & Requirements

  • Install/Run: Requires Python 3 and TensorFlow 1.15.0. Training and inference scripts (run_deepct.py) are provided.
  • Prerequisites: Uncased BERT base model (bert_base_dir), training data (TRAIN_DATA_FILE), and optionally pre-trained checkpoints.
  • Data: The repository links to pre-computed weights and processed data for MS MARCO Passage Ranking, simplifying reproduction.
  • Links: arXiv paper, WebConf2020 paper, MS-MARCO rankings

Highlighted Details

  • Generates floating-point term weights ($y_{t,d}$) that can be scaled to integer TF-like values (e.g., round(y * 100) or round(sqrt(y) * 100)).
  • Provides fine-tuned BM25 parameters for optimal performance with DeepCT-indexed data.
  • Offers pre-computed weights and processed data for MS MARCO Passage Ranking to facilitate immediate use.
  • Supports integration with indexing tools like Anserini, Indri, and Lucene.

Maintenance & Community

The project originates from research at CMU. Specific community channels or active maintenance status are not detailed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code is provided for research purposes, and commercial use would require clarification.

Limitations & Caveats

The code is dependent on TensorFlow 1.15.0, which is an older version and may present compatibility challenges with current environments. The training process requires significant data preparation and computational resources.

Health Check
Last commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 3 days ago
Feedback? Help us improve.