llm-datasets by mlabonne

Curated datasets/tools for LLM post-training

Created 2 years ago

4,688 stars

Top 10.4% on SourcePulse

View on GitHub

7 Experts Love This Project

Jesse Clark

Cofounder of Marqo

Luca Soldaini

Research Scientist at Ai2

Jeremy Howard

Cofounder of fast.ai

Shyamal Anadkat

Research Scientist at OpenAI

and 3 more!

Project Summary

This repository provides a curated list of datasets and tools for post-training Large Language Models (LLMs). It aims to help developers and researchers find high-quality data for supervised fine-tuning (SFT), preference alignment, and other LLM training tasks, covering general-purpose, math, code, instruction following, multilingual, agent, function calling, real conversations, and preference datasets.

How It Works

The project categorizes datasets based on their intended use in LLM post-training, such as Supervised Fine-Tuning (SFT) for instruction following or preference alignment for aligning model outputs with human preferences. It lists numerous datasets with their sizes, authors, dates, and notes on their content and licensing, often linking to the original papers or sources. The repository also includes a section on tools for data scraping, filtering, generation, and exploration, providing a comprehensive ecosystem for LLM data management.

Quick Start & Requirements

This repository is a curated list and does not require installation or execution. It serves as a reference guide.

Highlighted Details

Comprehensive coverage of LLM post-training data types, including specialized areas like math reasoning, code generation, and agent function calling.
Inclusion of tools for the entire data lifecycle: scraping, filtering, generation, and exploration.
Datasets are generally under permissive licenses (Apache 2.0, MIT, CC-BY-4.0), facilitating broad adoption.
Links to original papers and sources are provided for deeper dives into dataset methodologies.

Maintenance & Community

The repository is maintained by mlabonne and acknowledges contributions from several individuals. It includes links to the author's X (Twitter) and Blog, and references numerous academic papers, indicating a strong connection to the research community.

Licensing & Compatibility

Datasets are noted to be under permissive licenses such as Apache 2.0, MIT, and CC-BY-4.0. Some datasets may have specific licenses (e.g., CC-BY-NC-4.0 for tulu3-sft-mixture), which should be checked for commercial use or closed-source linking.

Limitations & Caveats

While the list is extensive, the quality and suitability of each dataset for specific LLM training tasks will vary and require individual evaluation. Some datasets are synthetic or derived from other sources, necessitating careful review for accuracy and potential biases.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

57 stars in the last 30 days