awesome-chatgpt-dataset by voidful

Dataset repo for LLM training

Created 3 years ago

765 stars

Top 44.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

This repository curates a comprehensive collection of datasets designed for training and fine-tuning Large Language Models (LLMs), particularly those aiming for ChatGPT-like conversational abilities. It targets researchers and developers seeking to build or improve their own LLM-powered applications by providing a structured overview of available instruction-following and conversational datasets.

How It Works

The project acts as a central index, listing and categorizing numerous datasets suitable for LLM training. It provides metadata for each dataset, including size, languages, source, and license, facilitating quick comparison and selection. A preprocess.py script is offered to assist users in merging and preparing selected datasets for upload to platforms like Hugging Face Hub.

Quick Start & Requirements

Install via git clone https://github.com/voidful/awesome-chatgpt-dataset.git
Requires Python for the preprocess.py script.
Datasets vary in size and licensing; users must manage their own download and storage.
Official documentation and dataset details are available within the repository.

Highlighted Details

Extensive catalog of over 50 datasets, ranging from a few thousand to tens of millions of examples.
Covers a wide array of languages, including English, Chinese, Portuguese, and Japanese.
Includes datasets specifically for code generation, mathematical reasoning, and multimodal instruction following.
Features datasets with varying licensing, from permissive MIT and Apache 2.0 to more restrictive CC BY-NC-SA and research-only licenses.

Maintenance & Community

The repository appears to be community-driven, with contributions from various users. Specific maintainer details or active community channels (like Discord/Slack) are not prominently featured in the README.

Licensing & Compatibility

Licenses vary significantly across datasets, including MIT, Apache 2.0, CC BY-NC-SA, CC BY 4.0, GPLv3, and custom licenses. Some datasets have restrictions for commercial use or require adherence to OpenAI's terms of use. Users must carefully review the license for each dataset they intend to use.

Limitations & Caveats

The repository itself does not host the datasets; users are responsible for acquiring and managing them. The quality and suitability of individual datasets for specific LLM training tasks are not guaranteed and require user evaluation. Some datasets may have unclear or conflicting licensing information.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days