Dataset list for LLM training
Top 15.4% on sourcepulse
This repository serves as a curated hub for Large Language Model (LLM) training datasets, focusing on instruction finetuning and pretraining corpora. It aims to consolidate scattered open-source datasets, making them accessible to researchers and practitioners looking to train or improve LLMs, particularly chatbots.
How It Works
The project organizes datasets by type (Alignment, Domain-specific, Pretraining, Multimodal) and release date, providing key metadata such as dataset name, usage, type (SFT, Dialog, Pairs, PT, RLHF, CoT), language, size, and a brief description. This structured approach facilitates efficient discovery and selection of relevant training data.
Quick Start & Requirements
This repository is a curated list of links and metadata; it does not require installation or execution. Users are directed to the original sources for dataset downloads and usage.
Highlighted Details
Maintenance & Community
The project is maintained by Zjh-819 and advised by Prof. Wanyun Cui. Contributions are welcomed via contact with the maintainer.
Licensing & Compatibility
Dataset licensing varies by the original source. Users must consult the licensing terms of each individual dataset. Compatibility for commercial use depends on the respective dataset licenses.
Limitations & Caveats
The repository is a curated list and does not host the datasets themselves. Users must navigate to external links for access, and some datasets may have specific usage restrictions or require significant processing. Some entries have notes like "⚠️use with care" or "⚠️RISKY," indicating potential issues with data quality or ethical considerations.
1 year ago
1 week