ccf_2020_qa_match by xv44586

QA matching competition code for question answering using BERT

Created 5 years ago

267 stars

Top 96.1% on SourcePulse

Project Summary

This repository contains code and techniques for achieving top performance in the CCF 2020 QA Matching competition. It targets NLP practitioners and researchers looking to improve question-answering systems, offering advanced methods beyond standard fine-tuning. The primary benefit is a proven path to state-of-the-art results on a challenging QA matching task.

How It Works

The project explores various advanced fine-tuning and training strategies for BERT-based models. Key techniques include post-training with masked language modeling (MLM) enhancements (whole word masking, dynamic masking, new word mining), incorporating external knowledge via embeddings, contrastive learning (self-supervised and supervised), self-distillation, and adversarial training. These methods aim to improve model robustness, generalization, and feature extraction for the specific task of matching questions to relevant answers.

Quick Start & Requirements

Install dependencies via pip.
Requires Python and a BERT-based pre-trained model (e.g., Nezha-base-wwm).
Specific scripts for different techniques are provided (e.g., pair-post-training-wwm-sop.py).
Links to competition dataset: https://www.datafountain.cn/competitions/474/datasets

Highlighted Details

Achieved Top 1 on both A/B leaderboards of the CCF 2020 QA Matching competition.
Post-training with MLM strategies (whole word mask + dynamic mask) showed significant improvements.
Explored both embedding-level and output-level knowledge fusion, though initial results were inconclusive.
Implemented both self-supervised and supervised contrastive learning approaches.
Utilized FGM for adversarial training on embeddings.

Maintenance & Community

The repository is from a competition winner, indicating a focus on achieving peak performance.
The author mentions code is being organized and will be released.
A summary blog post is linked for further details.

Licensing & Compatibility

No explicit license is mentioned in the README.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README indicates that complex classification layers (CNN/RNN/DGCNN) added after post-training did not yield further improvements. Integrating external knowledge via word2vec embeddings also did not improve performance in experiments. The data augmentation strategy using pseudo-labeling requires careful filtering to avoid introducing errors.

ccf_2020_qa_match by xv44586

Explore Similar Projects

PERT by ymcui

fancy-nlp by boat-group

MacBERT by ymcui

nlp_notes by YangBin1729

finetune by IndicoDataSolutions

ConSERT by yym6472

Pre-trained-Models by loujie0822

NLP-Projects by gaoisbest

dllm by ZHZisZZ

BERT-keras by Separius

EasyNLP by alibaba

text_classification by brightmart