Chinese-BERT-wwm  by ymcui

Pre-trained language models for Chinese NLP tasks

created 6 years ago
10,031 stars

Top 5.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a suite of pre-trained BERT models for Chinese natural language processing, specifically incorporating the Whole Word Masking (WWM) technique. It aims to improve Chinese NLP tasks by masking entire words rather than sub-word units, benefiting researchers and developers working with Chinese text.

How It Works

The core innovation is Whole Word Masking (WWM), which addresses the issue of BERT's WordPiece tokenization splitting Chinese words into individual characters. When a sub-word token is masked, WWM ensures all other sub-word tokens belonging to the same original word are also masked. This approach, applied to Chinese using LTP for word segmentation, leads to more robust language understanding and better performance on downstream tasks.

Quick Start & Requirements

  • Installation: Models can be loaded via 🤗 Transformers (hfl/chinese-bert-wwm-ext, etc.) or PaddleHub (chinese-bert-wwm-ext, etc.).
  • Dependencies: Python, 🤗 Transformers, or PaddleHub.
  • Model Files: Downloadable via Google Drive or Baidu Netdisk. Base model files are ~400MB.
  • Resources: Training was conducted on Google TPUs. Downstream task performance varies based on hyperparameters.
  • Documentation: HFL-Anthology, 🤗 Transformers, PaddleHub.

Highlighted Details

  • Offers multiple model variants: BERT-wwm, BERT-wwm-ext, RoBERTa-wwm-ext, RoBERTa-wwm-ext-large, RBT3, RBTL3.
  • Trained on extensive datasets (up to 5.4B tokens for EXT data).
  • Demonstrates state-of-the-art performance on various Chinese NLP benchmarks like CMRC 2018, DRCD, XNLI, and ChnSentiCorp.
  • Includes smaller parameter models (RBT3, RBTL3) for resource-constrained environments.

Maintenance & Community

  • Developed by Harbin Institute of Technology (HIT) and iFlytek Joint Laboratory (HFL).
  • Active development with recent releases of related models (e.g., Chinese LLaMA/Alpaca, MiniRBT).
  • Issues can be reported via GitHub Issues.

Licensing & Compatibility

  • The models are available for technical research reference.
  • Users can use the models within the license scope, but the authors disclaim responsibility for any direct or indirect damages.

Limitations & Caveats

  • The repository is not an official release from Google or iFlytek.
  • Reproducing exact benchmark results may be challenging due to random seeds and hardware variations.
  • The README notes that some models are "RoBERTa-like BERT" and should be handled as BERT during usage and conversion.
Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
123 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.