awesome-instruction-datasets by jianzhnie

Curated list of instruction datasets for training ChatLLMs

Created 2 years ago

717 stars

Top 48.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

This repository serves as a curated, comprehensive list of instruction-following datasets for training large language models (LLMs), particularly chat-based models like ChatGPT. It aims to accelerate research and development in Natural Language Processing (NLP) by providing easy access to a wide array of resources for instruction tuning and Reinforcement Learning from Human Feedback (RLHF). The target audience includes NLP researchers and developers working on LLM alignment and performance.

How It Works

The project categorizes and lists numerous instruction datasets, often detailing their source, generation method (human-generated, self-instruct, collection), language(s), task types (multi-task, task-specific), and instance counts. It also includes a separate section for RLHF datasets, highlighting human preference data crucial for aligning models with human values. The organization facilitates comparison and selection of datasets based on specific project needs.

Quick Start & Requirements

Datasets are typically accessed via Hugging Face Hub links or direct GitHub repository downloads.
No specific installation command is provided as it's a curated list, not a runnable tool.
Requirements depend on the individual datasets, which may include Python environments and specific NLP libraries.

Highlighted Details

Extensive coverage of both English and Chinese instruction datasets.
Detailed statistics and comparisons of various datasets, including generation methods and costs.
Inclusion of RLHF datasets, crucial for safety and helpfulness alignment.
Categorization by task type (e.g., general instruction, code, medical) and generation method.

Maintenance & Community

The repository is marked with an "Awesome" badge, indicating community curation.
A "Contributing" section invites community participation.
Links to related repositories and resources are provided for further exploration.

Licensing & Compatibility

The repository itself is released under the Apache 2.0 license.
Individual datasets within the collection will have their own licenses, which users must verify for compatibility, especially for commercial use.

Limitations & Caveats

Some datasets may have associated costs for generation (e.g., using GPT-3.5/4 APIs) or may not explicitly state their license, requiring careful due diligence by the user. The project is a living collection, and dataset availability or specific details might change.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days