quest_qa_labeling by oleg-yaroshevskiy

Advanced Q&A understanding for complex content

Created 5 years ago

250 stars

Top 100.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Malte Pietsch

Cofounder of deepset

Project Summary

Summary

This project offers a reproducible, 1st-place solution for the Google QUEST Q&A Labeling competition, enhancing automated understanding of complex question-answer content. It targets NLP researchers and practitioners seeking high-performance QA system methodologies. The primary benefit is a proven, top-scoring approach to tackling this specific NLP task.

How It Works

The approach involves finetuning language models on StackExchange data, followed by pseudo-label generation. An ensemble of BERT-base-cased, RoBERTa-base, and BART-large models is trained using both datasets. A 5-fold cross-validation strategy is applied per model type, with final predictions derived from averaging checkpoints and blending diverse model outputs. This ensemble and pseudo-labeling methodology is central to achieving top performance.

Quick Start & Requirements

Setup requires a Conda environment (conda create -n qa_quest_env python=3.6.6, conda activate qa_quest_env) and dependency installation via pip install -r requirements_full.txt or requirements_minimal.txt. Custom installations for mag and a modified fairseq are handled by bash/setup.sh.

Prerequisites: Python 3.6.6, Conda 4.7.10, CUDA 10.0.130, cuDNN 7.5.0, NVIDIA drivers v. 418.67.
Hardware: Significant GPU resources are recommended (multiple NVIDIA 1080 Ti or Quadro P6000), especially for LM training.
Data: Requires downloading competition data (bash/download_comp_data.sh) and ~18 GB of model checkpoints for inference (bash/download_all_model_ckpts_for_inference.sh). The README references a Kaggle Notebook for inference reproduction.

Highlighted Details

Achieved 1st place in the Google QUEST Q&A Labeling competition, scoring 0.46893 on the public leaderboard.
Employs a sophisticated ensemble of BERT, RoBERTa, and BART models.
Utilizes pseudo-labeling derived from StackExchange data to augment training.
Provides detailed scripts for reproducing LM finetuning, pseudo-label generation, model training, and inference.

Maintenance & Community

Project maintained by oleg-yaroshevskiy; contact yury.kashnitsky@gmail.com for questions. No explicit community channels or roadmap links are provided.

Licensing & Compatibility

License type is not explicitly stated in the provided README content.

Limitations & Caveats

Requires specific, older versions of Python (3.6.6), CUDA (10.0.130), and cuDNN (7.5.0), posing potential compatibility challenges. Significant GPU hardware is necessary for training and efficient inference. Setup involves custom library installations and handling large model checkpoints (~18 GB).

Health Check

Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days