awesome-instruction-dataset by yaodongC

Dataset collection for instruction-tuning LLMs

Created 2 years ago

1,138 stars

Top 33.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Eugene Yan

AI Scientist at AWS

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This repository is a curated collection of open-source datasets for training instruction-following Large Language Models (LLMs), including both text-only and multimodal models. It serves researchers and developers by providing a centralized, categorized list of resources for fine-tuning models like ChatGPT, LLaMA, and Alpaca, facilitating the development of more capable and aligned AI assistants.

How It Works

The collection categorizes datasets by modality (text, visual-instruction-tuning), language (English, Chinese, multilingual), task focus (multi-task, task-specific), and generation method (human-generated, self-instruct, mixed, collection). This structured approach allows users to efficiently find datasets tailored to their specific LLM training needs, whether for general instruction following, specialized tasks, or multimodal capabilities.

Quick Start & Requirements

This repository is a curated list and does not require installation or execution. Users are directed to individual linked GitHub repositories for dataset access and usage.

Highlighted Details

Comprehensive coverage of instruction tuning and RLHF datasets.
Includes both text-based and multimodal (image-text) instruction datasets.
Categorization by language, task, and generation method aids selection.
Provides dataset size, generation model, and licensing information for each entry.

Maintenance & Community

The repository is community-driven, with contributions from various researchers and organizations. It acts as a central index, linking to numerous active projects.

Licensing & Compatibility

Datasets are released under various licenses, including permissive ones like Apache 2.0, MIT, and CC BY 4.0, which are generally compatible with commercial use. However, some datasets are licensed under CC BY-NC 4.0 or GPL-3.0, which may impose non-commercial or copyleft restrictions. Users must consult the specific license for each linked dataset.

Limitations & Caveats

The repository is a curated list, and the quality, size, and licensing details are dependent on the individual linked projects. Some datasets may have specific data generation models (e.g., GPT-4) that could introduce biases or limitations.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days