GPT2-Chinese by Morizeyao

GPT2 training code for Chinese language models

Created 6 years ago

7,606 stars

Top 6.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Clement Delangue

Cofounder of Hugging Face

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

This repository provides tools and pre-trained models for training and generating text with GPT-2 specifically for the Chinese language. It targets researchers and developers interested in Chinese NLP tasks like poetry generation, news writing, and novel creation, offering flexibility in tokenization and training corpus size.

How It Works

The project leverages the HuggingFace Transformers library as its foundation. It supports training GPT-2 models using character-level, word-level (via BERT tokenizer), or Byte Pair Encoding (BPE) tokenization. This flexibility allows for adaptation to different Chinese language processing needs and corpus characteristics.

Quick Start & Requirements

Install: pip install -r requirements.txt (specific requirements not detailed in README)
Prerequisites: PyTorch, potentially Apex for FP16 support. BERT tokenizer requires pre-built vocabularies or custom vocabulary generation using make_vocab.py.
Data: Training corpus should be placed in a data folder as train.json.
Links: Huggingface Model Hub for pre-trained models.

Highlighted Details

Offers multiple pre-trained models for specific tasks: general Chinese, lyrics, ancient Chinese poetry, couplets, and classical Chinese.
Supports FP16 and gradient accumulation for potentially faster training, though FP16 convergence is noted as experimental.
Includes scripts for training (train.py, train_single.py), generation (generate.py, generate_texts.py), and evaluation (eval.py).
Provides sample generation outputs for various tasks and datasets.

Maintenance & Community

The project is noted as having stopped active maintenance, with the author stating it was a personal learning project. Communication and discussion are encouraged via GitHub Issues or email for group communication.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license.

Limitations & Caveats

The project is no longer actively maintained. FP16 training may not converge. Using the BPE tokenizer requires custom vocabulary creation. The README mentions potential issues with the TF2.0 version (Decoders-Chinese-TF2.0) and notes that GPT2-ML is unrelated.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

15 stars in the last 30 days