CPM-1-Generate  by TsinghuaAI

Text generation code for Chinese pre-trained language model (CPM-LM)

created 4 years ago
1,583 stars

Top 26.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides text generation code for the CPM-LM (2.6B) Chinese language model, enabling local testing and research into zero-shot/few-shot learning scenarios for Chinese NLP. It is primarily aimed at researchers and developers working with large-scale Chinese language models.

How It Works

The project is based on Megatron-LM and shares a similar architecture with GPT-2. It utilizes SentencePiece for BPE tokenization, with custom handling for spaces and newlines by replacing them with specific Unicode characters before tokenization and restoring them during generation. The model supports model parallelism, with a default of 2 GPUs, and includes scripts for zero-shot classification tasks.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt and install APEX for fp16 support. Alternatively, use the provided Docker image: docker pull dmye/cpm:v0.
  • Prerequisites: PyTorch, APEX (with CUDA support), two GPUs with ~7GB VRAM each for generation.
  • Running Generation: bash scripts/generate_text.sh /path/to/CPM (interactive) or bash scripts/generate_text.sh /path/to/CPM example.txt (file input).
  • Model Download: Requires downloading model checkpoints and arranging them in a specific directory structure. Checksums are provided.
  • Documentation: Model Download, Technical Report.

Highlighted Details

  • Supports zero-shot learning for classification tasks (OCNLI, TNEWS, IFLYTEK) with provided scripts.
  • Offers a distilled version, CPM-Distill (109M parameters), and a third-party implementation, CPM-Generate-distill.
  • Model parallelism can be adjusted using change_mp.py.
  • Efficient inference is recommended via BMInf for single-GPU setups (1060+).

Maintenance & Community

  • The project is associated with Tsinghua University.
  • A citation is provided for the CPM-1 model.

Licensing & Compatibility

  • The README does not explicitly state a license. The association with Tsinghua University and the nature of the code suggest it is for research purposes. Commercial use may require further clarification.

Limitations & Caveats

The generation script requires two GPUs. While BMInf is suggested for single-GPU inference, its integration is not detailed within this README. The project is primarily focused on text generation and zero-shot classification, with fine-tuning code listed as a future TODO.

Health Check
Last commit

2 years ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.3%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Feedback? Help us improve.