Code for training Japanese pretrained models
Top 56.8% on sourcepulse
This repository provides the code and methodology for training Japanese language models, specifically GPT-2 and RoBERTa variants, developed by rinna Co., Ltd. It enables researchers and developers to reproduce or build upon these pre-trained models for various Japanese NLP tasks.
How It Works
The project leverages the Hugging Face Transformers library for model architecture and training. It supports training from scratch using large Japanese corpora like Japanese Wikipedia and CC-100. Key architectural choices include standard GPT-2 and RoBERTa configurations, with specific guidance on data preprocessing, tokenization (using unidic), and training parameter tuning for optimal performance on Japanese text.
Quick Start & Requirements
pip install -r requirements.txt
unidic
(for tokenization: python -m unidic download
), Japanese CC-100 corpus, Japanese Wikipedia dump.Highlighted Details
rinna/japanese-gpt2-medium
, rinna/japanese-gpt2-small
, rinna/japanese-gpt2-xsmall
, and rinna/japanese-roberta-base
on Hugging Face.rinna/japanese-roberta-base
, such as prepending [CLS]
and providing explicit position_ids
.Maintenance & Community
The project is maintained by rinna Co., Ltd. Users can open issues on GitHub for support. The project was updated in January 2022.
Licensing & Compatibility
Limitations & Caveats
Training from scratch is resource-intensive. Specific usage notes for rinna/japanese-roberta-base
(e.g., [CLS]
token, position_ids
) are crucial for correct inference and may differ from standard Hugging Face implementations.
2 years ago
1 day