MacBERT by ymcui

Chinese NLP pre-trained language model research paper

Created 5 years ago

698 stars

Top 49.0% on SourcePulse

Project Summary

MacBERT is a pre-trained language model designed to improve Chinese Natural Language Processing (NLP) tasks by addressing the pre-training/downstream task discrepancy. It is targeted at NLP researchers and practitioners working with Chinese text. The primary benefit is enhanced performance on various NLP benchmarks.

How It Works

MacBERT introduces a novel "Masking as Correction" (Mac) pre-training task. Instead of simply masking tokens with [MASK], MacBERT replaces them with semantically similar words identified using word2vec similarity. This approach aims to create a more consistent objective between pre-training and downstream tasks, where actual word substitutions or corrections are more common than pure masking. It also incorporates Whole Word Masking and N-gram masking techniques.

Quick Start & Requirements

Installation: Models can be loaded via 🤗 Transformers using BertTokenizer.from_pretrained("MODEL_NAME") and BertModel.from_pretrained("MODEL_NAME").
Dependencies: Python, 🤗 Transformers library.
Models: Available in hfl/chinese-macbert-large and hfl/chinese-macbert-base on Hugging Face Hub. TensorFlow 1.x versions are also available for direct download.
Resources: MacBERT-base (102M parameters), MacBERT-large (324M parameters).

Highlighted Details

Achieves state-of-the-art results on multiple Chinese NLP tasks, including CMRC 2018, DRCD, XNLI, ChnSentiCorp, LCQMC, and BQ Corpus.
Demonstrates significant improvements over standard BERT and other BERT variants like BERT-wwm and RoBERTa-wwm-ext.
Seamlessly integrates with existing BERT-based codebases due to architectural similarity.

Maintenance & Community

The project is associated with the authors of the paper "Revisiting Pre-Trained Models for Chinese Natural Language Processing" (Findings of EMNLP 2020).
The authors have released other related projects like Chinese-LLaMA-Alpaca and PERT.
Issues can be reported via GitHub Issues.

Licensing & Compatibility

The README does not explicitly state a license. However, the models are available on Hugging Face Hub, which typically implies a permissive license for use and distribution. Commercial use compatibility would require explicit confirmation.

Limitations & Caveats

No English version of MacBERT is available.
The training code and pre-training corpus are not open-sourced due to licensing restrictions.
There are no current plans to train MacBERT on larger corpora or release the training code.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

4 stars in the last 30 days

Explore Similar Projects

SkyText-Chinese-GPT3 by SkyWorkAIGC

Chinese GPT3 pre-trained language model

Created 3 years ago

Updated 2 years ago

PERT by ymcui

Pre-training method for BERT using a permuted language model

Created 4 years ago

Updated 6 months ago

LMkor by kiyoungkim1

Korean language models for NLP tasks

Created 5 years ago

Updated 3 years ago

nlp_notes by YangBin1729

NLP notes for ML/DL principles, examples, and model deployment

Created 6 years ago

Updated 5 years ago

ccf_2020_qa_match by xv44586

QA matching competition code for question answering using BERT

Created 5 years ago

Updated 5 years ago

KoELECTRA by monologg

Pretrained ELECTRA model for Korean language tasks

Created 5 years ago

Updated 1 year ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

Chinese-ELECTRA by ymcui

Chinese ELECTRA pre-trained language models

Created 5 years ago

Updated 6 months ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen).

vilbert-multi-task by facebookresearch

Vision-language representation learning research paper & models

Created 6 years ago

Updated 3 years ago

NLP-BERT--ChineseVersion by Y1ran

PyTorch BERT implementation for Chinese readers, mirroring the original Google AI paper

Created 7 years ago

Updated 7 years ago

Chinese-XLNet by ymcui

Chinese XLNet pre-trained models for NLP tasks

Created 6 years ago

Updated 6 months ago

Chinese-BERT-wwm by ymcui

Pre-trained language models for Chinese NLP tasks

Created 6 years ago

Updated 6 months ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

29 more.

bert by google-research

TensorFlow code and pre-trained models for BERT

Created 7 years ago

Updated 1 year ago

Feedback? Help us improve.