GPT2-Chinese  by Morizeyao

GPT2 training code for Chinese language models

created 6 years ago
7,585 stars

Top 7.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides tools and pre-trained models for training and generating text with GPT-2 specifically for the Chinese language. It targets researchers and developers interested in Chinese NLP tasks like poetry generation, news writing, and novel creation, offering flexibility in tokenization and training corpus size.

How It Works

The project leverages the HuggingFace Transformers library as its foundation. It supports training GPT-2 models using character-level, word-level (via BERT tokenizer), or Byte Pair Encoding (BPE) tokenization. This flexibility allows for adaptation to different Chinese language processing needs and corpus characteristics.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (specific requirements not detailed in README)
  • Prerequisites: PyTorch, potentially Apex for FP16 support. BERT tokenizer requires pre-built vocabularies or custom vocabulary generation using make_vocab.py.
  • Data: Training corpus should be placed in a data folder as train.json.
  • Links: Huggingface Model Hub for pre-trained models.

Highlighted Details

  • Offers multiple pre-trained models for specific tasks: general Chinese, lyrics, ancient Chinese poetry, couplets, and classical Chinese.
  • Supports FP16 and gradient accumulation for potentially faster training, though FP16 convergence is noted as experimental.
  • Includes scripts for training (train.py, train_single.py), generation (generate.py, generate_texts.py), and evaluation (eval.py).
  • Provides sample generation outputs for various tasks and datasets.

Maintenance & Community

The project is noted as having stopped active maintenance, with the author stating it was a personal learning project. Communication and discussion are encouraged via GitHub Issues or email for group communication.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license.

Limitations & Caveats

The project is no longer actively maintained. FP16 training may not converge. Using the BPE tokenizer requires custom vocabulary creation. The README mentions potential issues with the TF2.0 version (Decoders-Chinese-TF2.0) and notes that GPT2-ML is unrelated.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
39 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.