LLMDataHub by Zjh-819

Dataset list for LLM training

Created 3 years ago

3,399 stars

Top 13.8% on SourcePulse

Project Summary

This repository serves as a curated hub for Large Language Model (LLM) training datasets, focusing on instruction finetuning and pretraining corpora. It aims to consolidate scattered open-source datasets, making them accessible to researchers and practitioners looking to train or improve LLMs, particularly chatbots.

How It Works

The project organizes datasets by type (Alignment, Domain-specific, Pretraining, Multimodal) and release date, providing key metadata such as dataset name, usage, type (SFT, Dialog, Pairs, PT, RLHF, CoT), language, size, and a brief description. This structured approach facilitates efficient discovery and selection of relevant training data.

Quick Start & Requirements

This repository is a curated list of links and metadata; it does not require installation or execution. Users are directed to the original sources for dataset downloads and usage.

Highlighted Details

Comprehensive categorization of datasets by training objective (e.g., Supervised Finetune, Reinforcement Learning from Human Feedback, Pretraining).
Inclusion of datasets specifically for improving LLM capabilities in areas like STEM reasoning, coding, and long-context understanding.
Coverage of both English and Chinese language datasets, as well as multilingual options.
Metadata includes dataset size, language, and specific use cases or models they were used with.

Maintenance & Community

The project is maintained by Zjh-819 and advised by Prof. Wanyun Cui. Contributions are welcomed via contact with the maintainer.

Licensing & Compatibility

Dataset licensing varies by the original source. Users must consult the licensing terms of each individual dataset. Compatibility for commercial use depends on the respective dataset licenses.

Limitations & Caveats

The repository is a curated list and does not host the datasets themselves. Users must navigate to external links for access, and some datasets may have specific usage restrictions or require significant processing. Some entries have notes like "⚠️use with care" or "⚠️RISKY," indicating potential issues with data quality or ethical considerations.

LLMDataHub by Zjh-819

Explore Similar Projects

instruction-datasets by raunak-agarwal

Awesome-LLM by MLNLP-World

InstructionZoo by FreedomIntelligence

LLaMA-Cult-and-More by shm007g

DialogStudio by salesforce

LLM-Synthetic-Data by pengr

sft_datasets by chaoswork

awesome-instruction-datasets by jianzhnie

mistral by stanford-crfm

awesome-instruction-dataset by yaodongC

Awesome-LLMs-Datasets by lmmlzn

llm-datasets by mlabonne