quest_qa_labeling  by oleg-yaroshevskiy

Advanced Q&A understanding for complex content

Created 5 years ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

This project offers a reproducible, 1st-place solution for the Google QUEST Q&A Labeling competition, enhancing automated understanding of complex question-answer content. It targets NLP researchers and practitioners seeking high-performance QA system methodologies. The primary benefit is a proven, top-scoring approach to tackling this specific NLP task.

How It Works

The approach involves finetuning language models on StackExchange data, followed by pseudo-label generation. An ensemble of BERT-base-cased, RoBERTa-base, and BART-large models is trained using both datasets. A 5-fold cross-validation strategy is applied per model type, with final predictions derived from averaging checkpoints and blending diverse model outputs. This ensemble and pseudo-labeling methodology is central to achieving top performance.

Quick Start & Requirements

Setup requires a Conda environment (conda create -n qa_quest_env python=3.6.6, conda activate qa_quest_env) and dependency installation via pip install -r requirements_full.txt or requirements_minimal.txt. Custom installations for mag and a modified fairseq are handled by bash/setup.sh.

  • Prerequisites: Python 3.6.6, Conda 4.7.10, CUDA 10.0.130, cuDNN 7.5.0, NVIDIA drivers v. 418.67.
  • Hardware: Significant GPU resources are recommended (multiple NVIDIA 1080 Ti or Quadro P6000), especially for LM training.
  • Data: Requires downloading competition data (bash/download_comp_data.sh) and ~18 GB of model checkpoints for inference (bash/download_all_model_ckpts_for_inference.sh). The README references a Kaggle Notebook for inference reproduction.

Highlighted Details

  • Achieved 1st place in the Google QUEST Q&A Labeling competition, scoring 0.46893 on the public leaderboard.
  • Employs a sophisticated ensemble of BERT, RoBERTa, and BART models.
  • Utilizes pseudo-labeling derived from StackExchange data to augment training.
  • Provides detailed scripts for reproducing LM finetuning, pseudo-label generation, model training, and inference.

Maintenance & Community

Project maintained by oleg-yaroshevskiy; contact yury.kashnitsky@gmail.com for questions. No explicit community channels or roadmap links are provided.

Licensing & Compatibility

License type is not explicitly stated in the provided README content.

Limitations & Caveats

Requires specific, older versions of Python (3.6.6), CUDA (10.0.130), and cuDNN (7.5.0), posing potential compatibility challenges. Significant GPU hardware is necessary for training and efficient inference. Setup involves custom library installations and handling large model checkpoints (~18 GB).

Health Check
Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
8 more.

galai by paperswithcode

0.1%
3k
Scientific language model API
Created 3 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Nir Gazit Nir Gazit(Cofounder of Traceloop), and
4 more.

llmware by llmware-ai

0.1%
14k
Framework for enterprise RAG pipelines using small, specialized models
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.