Curated datasets/tools for LLM post-training
Top 15.0% on sourcepulse
This repository provides a curated list of datasets and tools for post-training Large Language Models (LLMs). It aims to help developers and researchers find high-quality data for supervised fine-tuning (SFT), preference alignment, and other LLM training tasks, covering general-purpose, math, code, instruction following, multilingual, agent, function calling, real conversations, and preference datasets.
How It Works
The project categorizes datasets based on their intended use in LLM post-training, such as Supervised Fine-Tuning (SFT) for instruction following or preference alignment for aligning model outputs with human preferences. It lists numerous datasets with their sizes, authors, dates, and notes on their content and licensing, often linking to the original papers or sources. The repository also includes a section on tools for data scraping, filtering, generation, and exploration, providing a comprehensive ecosystem for LLM data management.
Quick Start & Requirements
This repository is a curated list and does not require installation or execution. It serves as a reference guide.
Highlighted Details
Maintenance & Community
The repository is maintained by mlabonne and acknowledges contributions from several individuals. It includes links to the author's X (Twitter) and Blog, and references numerous academic papers, indicating a strong connection to the research community.
Licensing & Compatibility
Datasets are noted to be under permissive licenses such as Apache 2.0, MIT, and CC-BY-4.0. Some datasets may have specific licenses (e.g., CC-BY-NC-4.0 for tulu3-sft-mixture), which should be checked for commercial use or closed-source linking.
Limitations & Caveats
While the list is extensive, the quality and suitability of each dataset for specific LLM training tasks will vary and require individual evaluation. Some datasets are synthetic or derived from other sources, necessitating careful review for accuracy and potential biases.
6 days ago
1 day