instruction-datasets by raunak-agarwal

Dataset list for instruction tuning of LLMs

Created 3 years ago

261 stars

Top 97.2% on SourcePulse

View on GitHub

3 Experts Love This Project

Binyuan Hui

Research Scientist at Alibaba Qwen

Luca Soldaini

Research Scientist at Ai2

Andreas Jansson

Cofounder of Replicate

Project Summary

This repository serves as a comprehensive, curated collection of datasets specifically designed for instruction tuning of Large Language Models (LLMs). It targets researchers and developers aiming to enhance LLM capabilities in following instructions across a wide array of NLP tasks, providing a centralized resource for high-quality, diverse training data.

How It Works

The project aggregates and links to numerous publicly available instruction-following datasets, categorized by their origin and characteristics. It includes datasets generated through human annotation, self-instruct methods, and large-scale model-based generation, covering diverse tasks, languages, and modalities. This approach offers a broad spectrum of data for robust LLM training and evaluation.

Quick Start & Requirements

Datasets are primarily accessed via Hugging Face datasets library or direct GitHub repository links.
No specific installation command is required for browsing; however, using the Hugging Face datasets library requires pip install datasets.
Dependencies are standard Python libraries and Hugging Face ecosystem components.

Highlighted Details

Extensive coverage of instruction tuning datasets, including P3, Natural Instructions v2, FLAN, Open Assistant, LIMA, and many more.
Inclusion of multi-modal instruction datasets like LLaVA Visual Instruct 150K and MIMIC-IT.
Datasets for preference learning and reward model training, such as HH-RLHF and OpenAI WebGPT comparisons.
Links to associated research papers and GitHub repositories for deeper dives into dataset creation and methodology.

Maintenance & Community

The repository is maintained by raunak-agarwal.
Links to Hugging Face, GitHub, and associated research papers are provided for each dataset, facilitating community engagement and further exploration.

Licensing & Compatibility

Licenses vary per dataset, with many available under permissive licenses (e.g., Apache 2.0, MIT) or specific academic use terms.
Users must consult the individual dataset licenses for commercial use or closed-source linking compatibility.

Limitations & Caveats

The repository itself does not host the datasets but provides links. Users are responsible for adhering to the specific terms and conditions of each linked dataset. Some datasets may have large download sizes or specific access requirements.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days