instruction-datasets  by raunak-agarwal

Dataset list for instruction tuning of LLMs

created 2 years ago
255 stars

Top 99.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository serves as a comprehensive, curated collection of datasets specifically designed for instruction tuning of Large Language Models (LLMs). It targets researchers and developers aiming to enhance LLM capabilities in following instructions across a wide array of NLP tasks, providing a centralized resource for high-quality, diverse training data.

How It Works

The project aggregates and links to numerous publicly available instruction-following datasets, categorized by their origin and characteristics. It includes datasets generated through human annotation, self-instruct methods, and large-scale model-based generation, covering diverse tasks, languages, and modalities. This approach offers a broad spectrum of data for robust LLM training and evaluation.

Quick Start & Requirements

  • Datasets are primarily accessed via Hugging Face datasets library or direct GitHub repository links.
  • No specific installation command is required for browsing; however, using the Hugging Face datasets library requires pip install datasets.
  • Dependencies are standard Python libraries and Hugging Face ecosystem components.

Highlighted Details

  • Extensive coverage of instruction tuning datasets, including P3, Natural Instructions v2, FLAN, Open Assistant, LIMA, and many more.
  • Inclusion of multi-modal instruction datasets like LLaVA Visual Instruct 150K and MIMIC-IT.
  • Datasets for preference learning and reward model training, such as HH-RLHF and OpenAI WebGPT comparisons.
  • Links to associated research papers and GitHub repositories for deeper dives into dataset creation and methodology.

Maintenance & Community

  • The repository is maintained by raunak-agarwal.
  • Links to Hugging Face, GitHub, and associated research papers are provided for each dataset, facilitating community engagement and further exploration.

Licensing & Compatibility

  • Licenses vary per dataset, with many available under permissive licenses (e.g., Apache 2.0, MIT) or specific academic use terms.
  • Users must consult the individual dataset licenses for commercial use or closed-source linking compatibility.

Limitations & Caveats

The repository itself does not host the datasets but provides links. Users are responsible for adhering to the specific terms and conditions of each linked dataset. Some datasets may have large download sizes or specific access requirements.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.