Chinese GPT model for short-text conversation, plus dataset
Top 23.5% on sourcepulse
This project provides a large-scale Chinese conversation dataset (LCCC) and pre-trained GPT models for Chinese dialogue generation. It's targeted at researchers and developers working on Chinese NLP and conversational AI, offering a robust foundation for building and evaluating dialogue systems.
How It Works
The project leverages the HuggingFace Transformers library, adapting the TransferTransfo architecture. Models are pre-trained in two stages: first on a Chinese novel corpus, then on the curated LCCC dataset. Dialogue history is concatenated into a single input sequence for predicting the next response, incorporating word, speaker, and positional embeddings.
Quick Start & Requirements
pip install -r requirements.txt
thu-coai/CDial-GPT_LCCC-large
).Highlighted Details
Maintenance & Community
The project is associated with Tsinghua University's AI research. Recent updates include dataset loading via Hugging Face datasets
and contributions for visualization and TF model loading.
Licensing & Compatibility
The dataset and pre-trained models are provided for research purposes only. The README does not specify a license for the code itself, but the disclaimer indicates restrictions on commercial use and liability for generated content.
Limitations & Caveats
The disclaimer explicitly states that despite rigorous cleaning, the LCCC dataset may still contain inappropriate content. The models and scripts are intended solely for research, and the authors disclaim responsibility for any generated dialogue.
2 years ago
1 day