LLM datasets survey for pre-training, fine-tuning, preference, evaluation, and NLP
Top 30.9% on sourcepulse
This repository serves as a comprehensive survey and catalog of datasets used for Large Language Models (LLMs), categorized across five primary dimensions: Pre-training Corpora, Instruction Fine-tuning Datasets, Preference Datasets, Evaluation Datasets, and Traditional NLP Datasets. It also includes emerging categories for Multi-modal LLMs and Retrieval Augmented Generation (RAG) datasets, aiming to provide a structured overview for researchers and practitioners in the LLM field.
How It Works
The project meticulously collects and organizes information on LLM datasets, detailing aspects such as release time, public availability, language, construction method, associated papers, GitHub repositories, dataset links, and publishers. Datasets are further classified by domain (e.g., finance, medical, code) and construction methodology (human-generated, model-constructed, or collection/improvement of existing data). The survey also includes detailed metadata for each dataset, such as size, license, and specific task categories.
Quick Start & Requirements
This repository is a curated list and does not require installation or execution. It serves as a reference guide.
Highlighted Details
Maintenance & Community
The project is actively maintained, with recent updates noted in the changelog. Contributions are welcomed via pull requests for new datasets or corrections. Contact information for maintainers is provided.
Licensing & Compatibility
Dataset licenses vary widely, including Apache-2.0, MIT, CC-BY-SA, and proprietary licenses. Users must consult individual dataset licenses for compatibility with commercial or closed-source applications.
Limitations & Caveats
The sheer volume of LLM datasets means that this catalog may not be exhaustive. The project relies on community contributions and publicly available information, and dataset availability or licensing may change. Future updates will focus on key details to maintain efficiency.
4 months ago
1 week