japanese-pretrained-models  by rinnakk

Code for training Japanese pretrained models

created 4 years ago
578 stars

Top 56.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the code and methodology for training Japanese language models, specifically GPT-2 and RoBERTa variants, developed by rinna Co., Ltd. It enables researchers and developers to reproduce or build upon these pre-trained models for various Japanese NLP tasks.

How It Works

The project leverages the Hugging Face Transformers library for model architecture and training. It supports training from scratch using large Japanese corpora like Japanese Wikipedia and CC-100. Key architectural choices include standard GPT-2 and RoBERTa configurations, with specific guidance on data preprocessing, tokenization (using unidic), and training parameter tuning for optimal performance on Japanese text.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python, unidic (for tokenization: python -m unidic download), Japanese CC-100 corpus, Japanese Wikipedia dump.
  • Training: Requires multiple GPUs (e.g., 4x V100 for GPT-2 xsmall, 8x V100 for RoBERTa base). Training times range from days to weeks depending on model size and hardware.
  • Resources: Training from scratch demands significant computational resources and large datasets.
  • Docs: Hugging Face Model Hub for pre-trained models.

Highlighted Details

  • Offers pre-trained models like rinna/japanese-gpt2-medium, rinna/japanese-gpt2-small, rinna/japanese-gpt2-xsmall, and rinna/japanese-roberta-base on Hugging Face.
  • Provides detailed instructions for training GPT-2 and RoBERTa models from scratch.
  • Includes specific usage tips for rinna/japanese-roberta-base, such as prepending [CLS] and providing explicit position_ids.
  • Code includes utilities for converting checkpoints to Hugging Face format and validation.

Maintenance & Community

The project is maintained by rinna Co., Ltd. Users can open issues on GitHub for support. The project was updated in January 2022.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

Training from scratch is resource-intensive. Specific usage notes for rinna/japanese-roberta-base (e.g., [CLS] token, position_ids) are crucial for correct inference and may differ from standard Hugging Face implementations.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.