This repository curates a comprehensive collection of open-source datasets for Supervised Fine-Tuning (SFT) of Large Language Models (LLMs). It targets researchers and developers working on LLM alignment and instruction tuning, providing a centralized resource for diverse tasks and languages.
How It Works
The project acts as a catalog, listing numerous SFT datasets with details on their size, language, task type (e.g., Machine Translation, Text Summarization, Instruction Following), generation method (e.g., Self-Instruct, Collection, Distillation), and data source. It facilitates discovery and access to a wide array of training data for LLMs.
Quick Start & Requirements
- Datasets are accessed via provided download links.
- No specific software installation is required to browse or download the dataset metadata. Actual dataset usage will depend on the specific LLM training framework.
Highlighted Details
- Extensive coverage of Chinese (CN) datasets, including Belle, Firefly, GAOKAO, and COIG.
- Includes English (EN) and mixed-language datasets like Alpaca, Dolly 2.0, and ShareChat.
- Covers a broad spectrum of NLP tasks, from general instruction following and dialogue to specialized areas like code generation and financial Q&A.
- Details data generation methods, including human annotation, distillation from powerful models (GPT-3, GPT-4), and self-instruct techniques.
Maintenance & Community
- The project is community-driven, with a call for contributions to expand the dataset list.
- Links to download are provided for each dataset.
Licensing & Compatibility
- Dataset licenses vary by source; users must verify individual dataset licenses for compatibility with their intended use, especially for commercial applications.
Limitations & Caveats
- The repository itself is a catalog; it does not host the datasets directly. Users must follow individual links to download.
- Dataset quality and suitability for specific LLM training tasks depend on the original source and generation methodology.