Text generation code for Chinese pre-trained language model (CPM-LM)
Top 26.9% on sourcepulse
This repository provides text generation code for the CPM-LM (2.6B) Chinese language model, enabling local testing and research into zero-shot/few-shot learning scenarios for Chinese NLP. It is primarily aimed at researchers and developers working with large-scale Chinese language models.
How It Works
The project is based on Megatron-LM and shares a similar architecture with GPT-2. It utilizes SentencePiece for BPE tokenization, with custom handling for spaces and newlines by replacing them with specific Unicode characters before tokenization and restoring them during generation. The model supports model parallelism, with a default of 2 GPUs, and includes scripts for zero-shot classification tasks.
Quick Start & Requirements
pip install -r requirements.txt
and install APEX for fp16 support. Alternatively, use the provided Docker image: docker pull dmye/cpm:v0
.bash scripts/generate_text.sh /path/to/CPM
(interactive) or bash scripts/generate_text.sh /path/to/CPM example.txt
(file input).Highlighted Details
change_mp.py
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The generation script requires two GPUs. While BMInf is suggested for single-GPU inference, its integration is not detailed within this README. The project is primarily focused on text generation and zero-shot classification, with fine-tuning code listed as a future TODO.
2 years ago
1+ week