Chinese NLP pre-trained language model research paper
Top 51.0% on sourcepulse
MacBERT is a pre-trained language model designed to improve Chinese Natural Language Processing (NLP) tasks by addressing the pre-training/downstream task discrepancy. It is targeted at NLP researchers and practitioners working with Chinese text. The primary benefit is enhanced performance on various NLP benchmarks.
How It Works
MacBERT introduces a novel "Masking as Correction" (Mac) pre-training task. Instead of simply masking tokens with [MASK]
, MacBERT replaces them with semantically similar words identified using word2vec similarity. This approach aims to create a more consistent objective between pre-training and downstream tasks, where actual word substitutions or corrections are more common than pure masking. It also incorporates Whole Word Masking and N-gram masking techniques.
Quick Start & Requirements
BertTokenizer.from_pretrained("MODEL_NAME")
and BertModel.from_pretrained("MODEL_NAME")
.hfl/chinese-macbert-large
and hfl/chinese-macbert-base
on Hugging Face Hub. TensorFlow 1.x versions are also available for direct download.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 weeks ago
1 day