CPM-1-Generate by TsinghuaAI

Text generation code for Chinese pre-trained language model (CPM-LM)

Created 5 years ago

1,582 stars

Top 26.2% on SourcePulse

Project Summary

This repository provides text generation code for the CPM-LM (2.6B) Chinese language model, enabling local testing and research into zero-shot/few-shot learning scenarios for Chinese NLP. It is primarily aimed at researchers and developers working with large-scale Chinese language models.

How It Works

The project is based on Megatron-LM and shares a similar architecture with GPT-2. It utilizes SentencePiece for BPE tokenization, with custom handling for spaces and newlines by replacing them with specific Unicode characters before tokenization and restoring them during generation. The model supports model parallelism, with a default of 2 GPUs, and includes scripts for zero-shot classification tasks.

Quick Start & Requirements

Installation: pip install -r requirements.txt and install APEX for fp16 support. Alternatively, use the provided Docker image: docker pull dmye/cpm:v0.
Prerequisites: PyTorch, APEX (with CUDA support), two GPUs with ~7GB VRAM each for generation.
Running Generation: bash scripts/generate_text.sh /path/to/CPM (interactive) or bash scripts/generate_text.sh /path/to/CPM example.txt (file input).
Model Download: Requires downloading model checkpoints and arranging them in a specific directory structure. Checksums are provided.
Documentation: Model Download, Technical Report.

Highlighted Details

Supports zero-shot learning for classification tasks (OCNLI, TNEWS, IFLYTEK) with provided scripts.
Offers a distilled version, CPM-Distill (109M parameters), and a third-party implementation, CPM-Generate-distill.
Model parallelism can be adjusted using change_mp.py.
Efficient inference is recommended via BMInf for single-GPU setups (1060+).

Maintenance & Community

The project is associated with Tsinghua University.
A citation is provided for the CPM-1 model.

Licensing & Compatibility

The README does not explicitly state a license. The association with Tsinghua University and the nature of the code suggest it is for research purposes. Commercial use may require further clarification.

Limitations & Caveats

The generation script requires two GPUs. While BMInf is suggested for single-GPU inference, its integration is not detailed within this README. The project is primarily focused on text generation and zero-shot classification, with fine-tuning code listed as a future TODO.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days