LLMDataHub  by Zjh-819

Dataset list for LLM training

created 2 years ago
3,198 stars

Top 15.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository serves as a curated hub for Large Language Model (LLM) training datasets, focusing on instruction finetuning and pretraining corpora. It aims to consolidate scattered open-source datasets, making them accessible to researchers and practitioners looking to train or improve LLMs, particularly chatbots.

How It Works

The project organizes datasets by type (Alignment, Domain-specific, Pretraining, Multimodal) and release date, providing key metadata such as dataset name, usage, type (SFT, Dialog, Pairs, PT, RLHF, CoT), language, size, and a brief description. This structured approach facilitates efficient discovery and selection of relevant training data.

Quick Start & Requirements

This repository is a curated list of links and metadata; it does not require installation or execution. Users are directed to the original sources for dataset downloads and usage.

Highlighted Details

  • Comprehensive categorization of datasets by training objective (e.g., Supervised Finetune, Reinforcement Learning from Human Feedback, Pretraining).
  • Inclusion of datasets specifically for improving LLM capabilities in areas like STEM reasoning, coding, and long-context understanding.
  • Coverage of both English and Chinese language datasets, as well as multilingual options.
  • Metadata includes dataset size, language, and specific use cases or models they were used with.

Maintenance & Community

The project is maintained by Zjh-819 and advised by Prof. Wanyun Cui. Contributions are welcomed via contact with the maintainer.

Licensing & Compatibility

Dataset licensing varies by the original source. Users must consult the licensing terms of each individual dataset. Compatibility for commercial use depends on the respective dataset licenses.

Limitations & Caveats

The repository is a curated list and does not host the datasets themselves. Users must navigate to external links for access, and some datasets may have specific usage restrictions or require significant processing. Some entries have notes like "⚠️use with care" or "⚠️RISKY," indicating potential issues with data quality or ethical considerations.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
183 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.