gpt2-ml  by imcaspar

GPT-2 for multiple languages, including pretrained models

created 5 years ago
1,712 stars

Top 25.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a GPT-2 implementation optimized for multilingual support, specifically featuring a 1.5 billion parameter Chinese pretrained model. It is designed for researchers and developers working with large-scale language models for Chinese text generation and analysis.

How It Works

The project adapts Grover's training scripts for GPT-2, incorporating a ported BERT tokenizer compatible with multilingual corpora. It leverages Cloud TPUs for efficient training, enabling the creation of large, high-performance models like the 1.5B parameter Chinese version.

Quick Start & Requirements

  • Install/Run: Colab demo available for quick experimentation.
  • Prerequisites: Google Colab, potentially Cloud TPUs for training.
  • Resources: Pretrained models are ~15-30GB.
  • Links: Colab Notebook

Highlighted Details

  • Features a 1.5 billion parameter GPT-2 model pretrained on Chinese corpora (~15GB and ~30GB versions).
  • Training utilized Cloud TPU Pod v3-256 for 220,000 steps.
  • Includes simplified training scripts based on Grover.
  • Compatible with multilingual corpora via a ported BERT tokenizer.

Maintenance & Community

  • Developed by Zhibo Zhang.
  • Research supported by Google's TensorFlow Research Cloud (TFRC).

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README.
  • The project is intended for academic research purposes.

Limitations & Caveats

The project is designated for academic research and does not offer conclusive remarks. The specific license for commercial use or closed-source linking is not detailed.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.