gpt2-ml by imcaspar

GPT-2 for multiple languages, including pretrained models

Created 6 years ago

1,708 stars

Top 24.6% on SourcePulse

Project Summary

This repository provides a GPT-2 implementation optimized for multilingual support, specifically featuring a 1.5 billion parameter Chinese pretrained model. It is designed for researchers and developers working with large-scale language models for Chinese text generation and analysis.

How It Works

The project adapts Grover's training scripts for GPT-2, incorporating a ported BERT tokenizer compatible with multilingual corpora. It leverages Cloud TPUs for efficient training, enabling the creation of large, high-performance models like the 1.5B parameter Chinese version.

Quick Start & Requirements

Install/Run: Colab demo available for quick experimentation.
Prerequisites: Google Colab, potentially Cloud TPUs for training.
Resources: Pretrained models are ~15-30GB.
Links: Colab Notebook

Highlighted Details

Features a 1.5 billion parameter GPT-2 model pretrained on Chinese corpora (~15GB and ~30GB versions).
Training utilized Cloud TPU Pod v3-256 for 220,000 steps.
Includes simplified training scripts based on Grover.
Compatible with multilingual corpora via a ported BERT tokenizer.

Maintenance & Community

Developed by Zhibo Zhang.
Research supported by Google's TensorFlow Research Cloud (TFRC).

Licensing & Compatibility

The repository's license is not explicitly stated in the README.
The project is intended for academic research purposes.

Limitations & Caveats

The project is designated for academic research and does not offer conclusive remarks. The specific license for commercial use or closed-source linking is not detailed.

gpt2-ml by imcaspar

Explore Similar Projects

Mengzi3 by Langboat

YAYI2 by wenge-research

bert-japanese by cl-tohoku

Unilm by YunwenTechnology

ru_transformers by mgrankin

KoELECTRA by monologg

dllm by ZHZisZZ

GLM by THUDM

text by pytorch

Chinese-BERT-wwm by ymcui

GPT2-Chinese by Morizeyao

bert by google-research