CDial-GPT  by thu-coai

Chinese GPT model for short-text conversation, plus dataset

created 5 years ago
1,891 stars

Top 23.5% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a large-scale Chinese conversation dataset (LCCC) and pre-trained GPT models for Chinese dialogue generation. It's targeted at researchers and developers working on Chinese NLP and conversational AI, offering a robust foundation for building and evaluating dialogue systems.

How It Works

The project leverages the HuggingFace Transformers library, adapting the TransferTransfo architecture. Models are pre-trained in two stages: first on a Chinese novel corpus, then on the curated LCCC dataset. Dialogue history is concatenated into a single input sequence for predicting the next response, incorporating word, speaker, and positional embeddings.

Quick Start & Requirements

  • Install via pip: pip install -r requirements.txt
  • Requires PyTorch.
  • Pre-trained models can be loaded from Hugging Face Hub (thu-coai/CDial-GPT_LCCC-large).
  • Fine-tuning requires datasets like STC.
  • Official quick-start and fine-tuning scripts are provided.

Highlighted Details

  • Offers two versions of the LCCC dataset: LCCC-base (more strictly filtered) and LCCC-large (larger, incorporating more sources).
  • Provides multiple pre-trained models (GPT Novel, CDial-GPT variants) with ~95.5M parameters.
  • Includes comprehensive evaluation results (automatic and human) on the STC dataset.
  • Supports distributed training.

Maintenance & Community

The project is associated with Tsinghua University's AI research. Recent updates include dataset loading via Hugging Face datasets and contributions for visualization and TF model loading.

Licensing & Compatibility

The dataset and pre-trained models are provided for research purposes only. The README does not specify a license for the code itself, but the disclaimer indicates restrictions on commercial use and liability for generated content.

Limitations & Caveats

The disclaimer explicitly states that despite rigorous cleaning, the LCCC dataset may still contain inappropriate content. The models and scripts are intended solely for research, and the authors disclaim responsibility for any generated dialogue.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
38 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.