MacBERT  by ymcui

Chinese NLP pre-trained language model research paper

created 4 years ago
678 stars

Top 51.0% on sourcepulse

GitHubView on GitHub
Project Summary

MacBERT is a pre-trained language model designed to improve Chinese Natural Language Processing (NLP) tasks by addressing the pre-training/downstream task discrepancy. It is targeted at NLP researchers and practitioners working with Chinese text. The primary benefit is enhanced performance on various NLP benchmarks.

How It Works

MacBERT introduces a novel "Masking as Correction" (Mac) pre-training task. Instead of simply masking tokens with [MASK], MacBERT replaces them with semantically similar words identified using word2vec similarity. This approach aims to create a more consistent objective between pre-training and downstream tasks, where actual word substitutions or corrections are more common than pure masking. It also incorporates Whole Word Masking and N-gram masking techniques.

Quick Start & Requirements

  • Installation: Models can be loaded via 🤗 Transformers using BertTokenizer.from_pretrained("MODEL_NAME") and BertModel.from_pretrained("MODEL_NAME").
  • Dependencies: Python, 🤗 Transformers library.
  • Models: Available in hfl/chinese-macbert-large and hfl/chinese-macbert-base on Hugging Face Hub. TensorFlow 1.x versions are also available for direct download.
  • Resources: MacBERT-base (102M parameters), MacBERT-large (324M parameters).

Highlighted Details

  • Achieves state-of-the-art results on multiple Chinese NLP tasks, including CMRC 2018, DRCD, XNLI, ChnSentiCorp, LCQMC, and BQ Corpus.
  • Demonstrates significant improvements over standard BERT and other BERT variants like BERT-wwm and RoBERTa-wwm-ext.
  • Seamlessly integrates with existing BERT-based codebases due to architectural similarity.

Maintenance & Community

  • The project is associated with the authors of the paper "Revisiting Pre-Trained Models for Chinese Natural Language Processing" (Findings of EMNLP 2020).
  • The authors have released other related projects like Chinese-LLaMA-Alpaca and PERT.
  • Issues can be reported via GitHub Issues.

Licensing & Compatibility

  • The README does not explicitly state a license. However, the models are available on Hugging Face Hub, which typically implies a permissive license for use and distribution. Commercial use compatibility would require explicit confirmation.

Limitations & Caveats

  • No English version of MacBERT is available.
  • The training code and pre-training corpus are not open-sourced due to licensing restrictions.
  • There are no current plans to train MacBERT on larger corpora or release the training code.
Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.