Term weighting framework for first-stage retrieval using BERT
Top 85.7% on sourcepulse
DeepCT and HDCT offer a framework for generating context-aware term weights for documents and queries, improving first-stage retrieval performance. The approach leverages BERT's contextualized embeddings to learn term importance, addressing the limitations of traditional term frequency methods. This is beneficial for researchers and practitioners in information retrieval seeking more sophisticated ranking signals.
How It Works
The core idea is to map BERT's contextual text representations to term weights. DeepCT processes sentences or passages, producing weights that can be integrated into standard inverted indexes. HDCT extends this to handle longer documents and supports weakly-supervised training, making it more scalable. This method aims to capture semantic importance beyond simple word counts.
Quick Start & Requirements
run_deepct.py
) are provided.bert_base_dir
), training data (TRAIN_DATA_FILE
), and optionally pre-trained checkpoints.Highlighted Details
round(y * 100)
or round(sqrt(y) * 100)
).Maintenance & Community
The project originates from research at CMU. Specific community channels or active maintenance status are not detailed in the README.
Licensing & Compatibility
The repository does not explicitly state a license. The code is provided for research purposes, and commercial use would require clarification.
Limitations & Caveats
The code is dependent on TensorFlow 1.15.0, which is an older version and may present compatibility challenges with current environments. The training process requires significant data preparation and computational resources.
4 years ago
Inactive