Pretrained BERT models for Japanese text
Top 60.1% on sourcepulse
This repository provides pre-trained BERT models specifically for the Japanese language, targeting NLP researchers and developers working with Japanese text. It offers several variants, including base and large architectures with both WordPiece and character-level tokenization, enabling more accurate and nuanced language understanding for Japanese.
How It Works
The models are based on the standard BERT architecture, utilizing 12 or 24 layers with varying hidden states and attention heads. Training was conducted on a large corpus comprising the Japanese portion of CC-100 (74.3GB) and Wikipedia (4.9GB), employing a two-stage training approach. This method first trains on the extensive CC-100 dataset for broad language coverage and then fine-tunes on the cleaner Wikipedia data to refine language quality. Whole word masking, using MeCab with the Unidic dictionary for tokenization, is applied during masked language modeling.
Quick Start & Requirements
The pre-trained models are available on Hugging Face's model hub. To use them, you'll need the transformers
library. The README provides detailed instructions and scripts for reproducing the training process, which requires significant computational resources, specifically Google Cloud TPUs (v3-8 instances were used), and TensorFlow v2.11.0. The training process itself is lengthy, with base models taking approximately 16 days and large models around 56 days on a v3-8 TPU.
Highlighted Details
bert-base-japanese-v3
, bert-base-japanese-char-v3
, bert-large-japanese-v2
, and bert-large-japanese-char-v2
.Maintenance & Community
The repository is maintained by the Tohoku NLP group. Further information and older model versions can be found by referring to specific tags (v1.0, v2.0) of the repository.
Licensing & Compatibility
The pre-trained models and code are distributed under the Apache License 2.0, which permits commercial use and modification.
Limitations & Caveats
The detailed training scripts are heavily reliant on Google Cloud TPUs and specific TensorFlow versions (v2.11.0), making reproduction challenging without similar infrastructure. Performance benchmarks are based on single random seeds and should be considered informative rather than definitive.
1 year ago
Inactive