Awesome-LLMs-Datasets  by lmmlzn

LLM datasets survey for pre-training, fine-tuning, preference, evaluation, and NLP

created 1 year ago
1,326 stars

Top 30.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository serves as a comprehensive survey and catalog of datasets used for Large Language Models (LLMs), categorized across five primary dimensions: Pre-training Corpora, Instruction Fine-tuning Datasets, Preference Datasets, Evaluation Datasets, and Traditional NLP Datasets. It also includes emerging categories for Multi-modal LLMs and Retrieval Augmented Generation (RAG) datasets, aiming to provide a structured overview for researchers and practitioners in the LLM field.

How It Works

The project meticulously collects and organizes information on LLM datasets, detailing aspects such as release time, public availability, language, construction method, associated papers, GitHub repositories, dataset links, and publishers. Datasets are further classified by domain (e.g., finance, medical, code) and construction methodology (human-generated, model-constructed, or collection/improvement of existing data). The survey also includes detailed metadata for each dataset, such as size, license, and specific task categories.

Quick Start & Requirements

This repository is a curated list and does not require installation or execution. It serves as a reference guide.

Highlighted Details

  • Extensive Coverage: Encompasses over 444 datasets, spanning 8 language categories and 32 domains.
  • Detailed Metadata: Information is provided across 20 dimensions, including corpus size (up to 774.5 TB for pre-training) and instance counts for other datasets (over 700M).
  • Regular Updates: New sections for Multi-modal LLMs and RAG datasets have been added, with plans for continued updates.
  • Comprehensive Survey Paper: A linked survey paper provides a deeper analysis of LLM datasets, their challenges, and future trends.

Maintenance & Community

The project is actively maintained, with recent updates noted in the changelog. Contributions are welcomed via pull requests for new datasets or corrections. Contact information for maintainers is provided.

Licensing & Compatibility

Dataset licenses vary widely, including Apache-2.0, MIT, CC-BY-SA, and proprietary licenses. Users must consult individual dataset licenses for compatibility with commercial or closed-source applications.

Limitations & Caveats

The sheer volume of LLM datasets means that this catalog may not be exhaustive. The project relies on community contributions and publicly available information, and dataset availability or licensing may change. Future updates will focus on key details to maintain efficiency.

Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
70 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.