Chinese-BERT-wwm by ymcui

Pre-trained language models for Chinese NLP tasks

Created 6 years ago

10,166 stars

Top 5.0% on SourcePulse

Project Summary

This repository provides a suite of pre-trained BERT models for Chinese natural language processing, specifically incorporating the Whole Word Masking (WWM) technique. It aims to improve Chinese NLP tasks by masking entire words rather than sub-word units, benefiting researchers and developers working with Chinese text.

How It Works

The core innovation is Whole Word Masking (WWM), which addresses the issue of BERT's WordPiece tokenization splitting Chinese words into individual characters. When a sub-word token is masked, WWM ensures all other sub-word tokens belonging to the same original word are also masked. This approach, applied to Chinese using LTP for word segmentation, leads to more robust language understanding and better performance on downstream tasks.

Quick Start & Requirements

Installation: Models can be loaded via 🤗 Transformers (hfl/chinese-bert-wwm-ext, etc.) or PaddleHub (chinese-bert-wwm-ext, etc.).
Dependencies: Python, 🤗 Transformers, or PaddleHub.
Model Files: Downloadable via Google Drive or Baidu Netdisk. Base model files are ~400MB.
Resources: Training was conducted on Google TPUs. Downstream task performance varies based on hyperparameters.
Documentation: HFL-Anthology, 🤗 Transformers, PaddleHub.

Highlighted Details

Offers multiple model variants: BERT-wwm, BERT-wwm-ext, RoBERTa-wwm-ext, RoBERTa-wwm-ext-large, RBT3, RBTL3.
Trained on extensive datasets (up to 5.4B tokens for EXT data).
Demonstrates state-of-the-art performance on various Chinese NLP benchmarks like CMRC 2018, DRCD, XNLI, and ChnSentiCorp.
Includes smaller parameter models (RBT3, RBTL3) for resource-constrained environments.

Maintenance & Community

Developed by Harbin Institute of Technology (HIT) and iFlytek Joint Laboratory (HFL).
Active development with recent releases of related models (e.g., Chinese LLaMA/Alpaca, MiniRBT).
Issues can be reported via GitHub Issues.

Licensing & Compatibility

The models are available for technical research reference.
Users can use the models within the license scope, but the authors disclaim responsibility for any direct or indirect damages.

Limitations & Caveats

The repository is not an official release from Google or iFlytek.
Reproducing exact benchmark results may be challenging due to random seeds and hardware variations.
The README notes that some models are "RoBERTa-like BERT" and should be handled as BERT during usage and conversion.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

24 stars in the last 30 days