Awesome-LLMs-Datasets by lmmlzn

LLM datasets survey for pre-training, fine-tuning, preference, evaluation, and NLP

Created 2 years ago

1,421 stars

Top 28.4% on SourcePulse

Project Summary

This repository serves as a comprehensive survey and catalog of datasets used for Large Language Models (LLMs), categorized across five primary dimensions: Pre-training Corpora, Instruction Fine-tuning Datasets, Preference Datasets, Evaluation Datasets, and Traditional NLP Datasets. It also includes emerging categories for Multi-modal LLMs and Retrieval Augmented Generation (RAG) datasets, aiming to provide a structured overview for researchers and practitioners in the LLM field.

How It Works

The project meticulously collects and organizes information on LLM datasets, detailing aspects such as release time, public availability, language, construction method, associated papers, GitHub repositories, dataset links, and publishers. Datasets are further classified by domain (e.g., finance, medical, code) and construction methodology (human-generated, model-constructed, or collection/improvement of existing data). The survey also includes detailed metadata for each dataset, such as size, license, and specific task categories.

Quick Start & Requirements

This repository is a curated list and does not require installation or execution. It serves as a reference guide.

Highlighted Details

Extensive Coverage: Encompasses over 444 datasets, spanning 8 language categories and 32 domains.
Detailed Metadata: Information is provided across 20 dimensions, including corpus size (up to 774.5 TB for pre-training) and instance counts for other datasets (over 700M).
Regular Updates: New sections for Multi-modal LLMs and RAG datasets have been added, with plans for continued updates.
Comprehensive Survey Paper: A linked survey paper provides a deeper analysis of LLM datasets, their challenges, and future trends.

Maintenance & Community

The project is actively maintained, with recent updates noted in the changelog. Contributions are welcomed via pull requests for new datasets or corrections. Contact information for maintainers is provided.

Licensing & Compatibility

Dataset licenses vary widely, including Apache-2.0, MIT, CC-BY-SA, and proprietary licenses. Users must consult individual dataset licenses for compatibility with commercial or closed-source applications.

Limitations & Caveats

The sheer volume of LLM datasets means that this catalog may not be exhaustive. The project relies on community contributions and publicly available information, and dataset availability or licensing may change. Future updates will focus on key details to maintain efficiency.

Awesome-LLMs-Datasets by lmmlzn

Explore Similar Projects

instruction-datasets by raunak-agarwal

Open-Qwen2VL by Victorwz

ContinualLM by UIC-Liu-Lab

Awesome-LLM by MLNLP-World

LLaMA-Cult-and-More by shm007g

awesome-instruction-datasets by jianzhnie

sft_datasets by chaoswork

awesome-instruction-dataset by yaodongC

finetune by IndicoDataSolutions

indicnlp_catalog by AI4Bharat

LLMDataHub by Zjh-819

llm-datasets by mlabonne