chatgpt-corpus  by PlexPt

Chinese corpus for LLM training

Created 2 years ago
913 stars

Top 39.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a collection of Chinese language corpora generated by ChatGPT, aimed at researchers and developers training large language models. It offers diverse datasets including conversational Q&A, customer service dialogues, and novel outlines, facilitating the development of more capable Chinese NLP applications.

How It Works

The corpus is primarily generated using ChatGPT 3.5, leveraging its capabilities to produce large volumes of structured and unstructured text data. This approach allows for the creation of specialized datasets, such as customer service interactions and narrative content, tailored for specific training objectives in LLM development.

Quick Start & Requirements

Highlighted Details

  • Contains approximately 670,000 Chinese questions generated by ChatGPT 3.5.
  • Includes 3 million GPT3.5 self-Q&A data points.
  • Offers around 2 million customer service Q&A pairs generated by ChatGPT 3.5.
  • Provides datasets of novels and novel outlines generated by ChatGPT 3.5.

Maintenance & Community

  • A QQ group (558195310) is available for discussion and collaboration.
  • Contact via WeChat is provided for project cooperation, with a request to specify the purpose.
  • Updates are ongoing as more data is cleaned.

Licensing & Compatibility

  • The repository does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The corpus is entirely generated by ChatGPT 3.5, which may introduce biases or factual inaccuracies inherent to the model. The lack of an explicit license raises concerns regarding usage rights, particularly for commercial applications.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.