bert-japanese  by cl-tohoku

Pretrained BERT models for Japanese text

created 5 years ago
534 stars

Top 60.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides pre-trained BERT models specifically for the Japanese language, targeting NLP researchers and developers working with Japanese text. It offers several variants, including base and large architectures with both WordPiece and character-level tokenization, enabling more accurate and nuanced language understanding for Japanese.

How It Works

The models are based on the standard BERT architecture, utilizing 12 or 24 layers with varying hidden states and attention heads. Training was conducted on a large corpus comprising the Japanese portion of CC-100 (74.3GB) and Wikipedia (4.9GB), employing a two-stage training approach. This method first trains on the extensive CC-100 dataset for broad language coverage and then fine-tunes on the cleaner Wikipedia data to refine language quality. Whole word masking, using MeCab with the Unidic dictionary for tokenization, is applied during masked language modeling.

Quick Start & Requirements

The pre-trained models are available on Hugging Face's model hub. To use them, you'll need the transformers library. The README provides detailed instructions and scripts for reproducing the training process, which requires significant computational resources, specifically Google Cloud TPUs (v3-8 instances were used), and TensorFlow v2.11.0. The training process itself is lengthy, with base models taking approximately 16 days and large models around 56 days on a v3-8 TPU.

Highlighted Details

  • Offers four main model variants: bert-base-japanese-v3, bert-base-japanese-char-v3, bert-large-japanese-v2, and bert-large-japanese-char-v2.
  • Models are trained on a combined 79.2GB Japanese corpus from CC-100 and Wikipedia.
  • Includes detailed scripts for data preprocessing, tokenizer training, and model training on TPUs.
  • Provides a script for converting TensorFlow checkpoints to Hugging Face compatible formats.

Maintenance & Community

The repository is maintained by the Tohoku NLP group. Further information and older model versions can be found by referring to specific tags (v1.0, v2.0) of the repository.

Licensing & Compatibility

The pre-trained models and code are distributed under the Apache License 2.0, which permits commercial use and modification.

Limitations & Caveats

The detailed training scripts are heavily reliant on Google Cloud TPUs and specific TensorFlow versions (v2.11.0), making reproduction challenging without similar infrastructure. Performance benchmarks are based on single random seeds and should be considered informative rather than definitive.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.