ccf_2020_qa_match  by xv44586

QA matching competition code for question answering using BERT

created 4 years ago
266 stars

Top 96.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository contains code and techniques for achieving top performance in the CCF 2020 QA Matching competition. It targets NLP practitioners and researchers looking to improve question-answering systems, offering advanced methods beyond standard fine-tuning. The primary benefit is a proven path to state-of-the-art results on a challenging QA matching task.

How It Works

The project explores various advanced fine-tuning and training strategies for BERT-based models. Key techniques include post-training with masked language modeling (MLM) enhancements (whole word masking, dynamic masking, new word mining), incorporating external knowledge via embeddings, contrastive learning (self-supervised and supervised), self-distillation, and adversarial training. These methods aim to improve model robustness, generalization, and feature extraction for the specific task of matching questions to relevant answers.

Quick Start & Requirements

  • Install dependencies via pip.
  • Requires Python and a BERT-based pre-trained model (e.g., Nezha-base-wwm).
  • Specific scripts for different techniques are provided (e.g., pair-post-training-wwm-sop.py).
  • Links to competition dataset: https://www.datafountain.cn/competitions/474/datasets

Highlighted Details

  • Achieved Top 1 on both A/B leaderboards of the CCF 2020 QA Matching competition.
  • Post-training with MLM strategies (whole word mask + dynamic mask) showed significant improvements.
  • Explored both embedding-level and output-level knowledge fusion, though initial results were inconclusive.
  • Implemented both self-supervised and supervised contrastive learning approaches.
  • Utilized FGM for adversarial training on embeddings.

Maintenance & Community

  • The repository is from a competition winner, indicating a focus on achieving peak performance.
  • The author mentions code is being organized and will be released.
  • A summary blog post is linked for further details.

Licensing & Compatibility

  • No explicit license is mentioned in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README indicates that complex classification layers (CNN/RNN/DGCNN) added after post-training did not yield further improvements. Integrating external knowledge via word2vec embeddings also did not improve performance in experiments. The data augmentation strategy using pseudo-labeling requires careful filtering to avoid introducing errors.

Health Check
Last commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 3 days ago
Feedback? Help us improve.