awesome-instruction-dataset  by yaodongC

Dataset collection for instruction-tuning LLMs

created 2 years ago
1,127 stars

Top 34.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository is a curated collection of open-source datasets for training instruction-following Large Language Models (LLMs), including both text-only and multimodal models. It serves researchers and developers by providing a centralized, categorized list of resources for fine-tuning models like ChatGPT, LLaMA, and Alpaca, facilitating the development of more capable and aligned AI assistants.

How It Works

The collection categorizes datasets by modality (text, visual-instruction-tuning), language (English, Chinese, multilingual), task focus (multi-task, task-specific), and generation method (human-generated, self-instruct, mixed, collection). This structured approach allows users to efficiently find datasets tailored to their specific LLM training needs, whether for general instruction following, specialized tasks, or multimodal capabilities.

Quick Start & Requirements

This repository is a curated list and does not require installation or execution. Users are directed to individual linked GitHub repositories for dataset access and usage.

Highlighted Details

  • Comprehensive coverage of instruction tuning and RLHF datasets.
  • Includes both text-based and multimodal (image-text) instruction datasets.
  • Categorization by language, task, and generation method aids selection.
  • Provides dataset size, generation model, and licensing information for each entry.

Maintenance & Community

The repository is community-driven, with contributions from various researchers and organizations. It acts as a central index, linking to numerous active projects.

Licensing & Compatibility

Datasets are released under various licenses, including permissive ones like Apache 2.0, MIT, and CC BY 4.0, which are generally compatible with commercial use. However, some datasets are licensed under CC BY-NC 4.0 or GPL-3.0, which may impose non-commercial or copyleft restrictions. Users must consult the specific license for each linked dataset.

Limitations & Caveats

The repository is a curated list, and the quality, size, and licensing details are dependent on the individual linked projects. Some datasets may have specific data generation models (e.g., GPT-4) that could introduce biases or limitations.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.