Dataset collection for instruction-tuning LLMs
Top 34.7% on sourcepulse
This repository is a curated collection of open-source datasets for training instruction-following Large Language Models (LLMs), including both text-only and multimodal models. It serves researchers and developers by providing a centralized, categorized list of resources for fine-tuning models like ChatGPT, LLaMA, and Alpaca, facilitating the development of more capable and aligned AI assistants.
How It Works
The collection categorizes datasets by modality (text, visual-instruction-tuning), language (English, Chinese, multilingual), task focus (multi-task, task-specific), and generation method (human-generated, self-instruct, mixed, collection). This structured approach allows users to efficiently find datasets tailored to their specific LLM training needs, whether for general instruction following, specialized tasks, or multimodal capabilities.
Quick Start & Requirements
This repository is a curated list and does not require installation or execution. Users are directed to individual linked GitHub repositories for dataset access and usage.
Highlighted Details
Maintenance & Community
The repository is community-driven, with contributions from various researchers and organizations. It acts as a central index, linking to numerous active projects.
Licensing & Compatibility
Datasets are released under various licenses, including permissive ones like Apache 2.0, MIT, and CC BY 4.0, which are generally compatible with commercial use. However, some datasets are licensed under CC BY-NC 4.0 or GPL-3.0, which may impose non-commercial or copyleft restrictions. Users must consult the specific license for each linked dataset.
Limitations & Caveats
The repository is a curated list, and the quality, size, and licensing details are dependent on the individual linked projects. Some datasets may have specific data generation models (e.g., GPT-4) that could introduce biases or limitations.
1 year ago
1 week