japanese-pretrained-models by rinnakk

Code for training Japanese pretrained models

created 4 years ago

578 stars

Top 56.8% on sourcepulse

Project Summary

This repository provides the code and methodology for training Japanese language models, specifically GPT-2 and RoBERTa variants, developed by rinna Co., Ltd. It enables researchers and developers to reproduce or build upon these pre-trained models for various Japanese NLP tasks.

How It Works

The project leverages the Hugging Face Transformers library for model architecture and training. It supports training from scratch using large Japanese corpora like Japanese Wikipedia and CC-100. Key architectural choices include standard GPT-2 and RoBERTa configurations, with specific guidance on data preprocessing, tokenization (using unidic), and training parameter tuning for optimal performance on Japanese text.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python, unidic (for tokenization: python -m unidic download), Japanese CC-100 corpus, Japanese Wikipedia dump.
Training: Requires multiple GPUs (e.g., 4x V100 for GPT-2 xsmall, 8x V100 for RoBERTa base). Training times range from days to weeks depending on model size and hardware.
Resources: Training from scratch demands significant computational resources and large datasets.
Docs: Hugging Face Model Hub for pre-trained models.

Highlighted Details

Offers pre-trained models like rinna/japanese-gpt2-medium, rinna/japanese-gpt2-small, rinna/japanese-gpt2-xsmall, and rinna/japanese-roberta-base on Hugging Face.
Provides detailed instructions for training GPT-2 and RoBERTa models from scratch.
Includes specific usage tips for rinna/japanese-roberta-base, such as prepending [CLS] and providing explicit position_ids.
Code includes utilities for converting checkpoints to Hugging Face format and validation.

Maintenance & Community

The project is maintained by rinna Co., Ltd. Users can open issues on GitHub for support. The project was updated in January 2022.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

Training from scratch is resource-intensive. Specific usage notes for rinna/japanese-roberta-base (e.g., [CLS] token, position_ids) are crucial for correct inference and may differ from standard Hugging Face implementations.

japanese-pretrained-models by rinnakk

Explore Similar Projects

Firefly-LLaMA2-Chinese by yangjianxin1

bert-japanese by cl-tohoku

MINI_LLM by jiahe7ay

Steel-LLM by zhanshijinwat

ru_transformers by mgrankin

naacl_transfer_learning_tutorial by huggingface

ChatLM-mini-Chinese by charent

Linly by CVI-SZU

transformers_tasks by HarderThenHarder

autotrain-advanced by huggingface

OLMo by allenai

GPT2-Chinese by Morizeyao