chatgpt-corpus by PlexPt

Chinese corpus for LLM training

Created 3 years ago

964 stars

Top 37.6% on SourcePulse

Project Summary

This repository provides a collection of Chinese language corpora generated by ChatGPT, aimed at researchers and developers training large language models. It offers diverse datasets including conversational Q&A, customer service dialogues, and novel outlines, facilitating the development of more capable Chinese NLP applications.

How It Works

The corpus is primarily generated using ChatGPT 3.5, leveraging its capabilities to produce large volumes of structured and unstructured text data. This approach allows for the creation of specialized datasets, such as customer service interactions and narrative content, tailored for specific training objectives in LLM development.

Quick Start & Requirements

Datasets are available for download via GitHub Releases:
- 670k Chinese questions: https://github.com/PlexPt/chatgpt-corpus/releases/tag/3
- 3M GPT3.5 self-Q&A data: https://github.com/PlexPt/chatgpt-corpus/releases/tag/3
- 2M customer service Q&A: https://github.com/PlexPt/chatgpt-corpus/tree/main/kefu
- Novel outlines and novels: https://github.com/PlexPt/chatgpt-corpus/releases/tag/4
Requirements: Access to datasets, standard data processing tools.

Highlighted Details

Contains approximately 670,000 Chinese questions generated by ChatGPT 3.5.
Includes 3 million GPT3.5 self-Q&A data points.
Offers around 2 million customer service Q&A pairs generated by ChatGPT 3.5.
Provides datasets of novels and novel outlines generated by ChatGPT 3.5.

Maintenance & Community

A QQ group (558195310) is available for discussion and collaboration.
Contact via WeChat is provided for project cooperation, with a request to specify the purpose.
Updates are ongoing as more data is cleaned.

Licensing & Compatibility

The repository does not explicitly state a license.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The corpus is entirely generated by ChatGPT 3.5, which may introduce biases or factual inaccuracies inherent to the model. The lack of an explicit license raises concerns regarding usage rights, particularly for commercial applications.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days