GPT2 training code for Chinese language models
Top 7.0% on sourcepulse
This repository provides tools and pre-trained models for training and generating text with GPT-2 specifically for the Chinese language. It targets researchers and developers interested in Chinese NLP tasks like poetry generation, news writing, and novel creation, offering flexibility in tokenization and training corpus size.
How It Works
The project leverages the HuggingFace Transformers library as its foundation. It supports training GPT-2 models using character-level, word-level (via BERT tokenizer), or Byte Pair Encoding (BPE) tokenization. This flexibility allows for adaptation to different Chinese language processing needs and corpus characteristics.
Quick Start & Requirements
pip install -r requirements.txt
(specific requirements not detailed in README)make_vocab.py
.data
folder as train.json
.Highlighted Details
train.py
, train_single.py
), generation (generate.py
, generate_texts.py
), and evaluation (eval.py
).Maintenance & Community
The project is noted as having stopped active maintenance, with the author stating it was a personal learning project. Communication and discussion are encouraged via GitHub Issues or email for group communication.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license.
Limitations & Caveats
The project is no longer actively maintained. FP16 training may not converge. Using the BPE tokenizer requires custom vocabulary creation. The README mentions potential issues with the TF2.0 version (Decoders-Chinese-TF2.0
) and notes that GPT2-ML is unrelated.
1 year ago
Inactive