InstructionZoo by FreedomIntelligence

Instruction-tuning dataset collection for chat-based LLMs

Created 2 years ago

282 stars

Top 92.6% on SourcePulse

Project Summary

InstructionZoo is a comprehensive, ongoing collection of open-source instruction-tuning datasets designed to train chat-based Large Language Models (LLMs) like ChatGPT, LLaMA, and Alpaca. It serves researchers and developers working on LLM alignment and instruction following, offering a diverse range of English, Chinese, and multilingual datasets for various NLP tasks.

How It Works

The project curates and categorizes numerous instruction-tuning datasets, providing detailed metadata for each. This includes dataset size, language, summary, generation method, associated papers, HuggingFace links, and licensing information. The collection aims to standardize and simplify access to these valuable resources for LLM training and evaluation.

Quick Start & Requirements

Access: Datasets are primarily accessed via HuggingFace Hub.
Requirements: Requires Python and the datasets library for programmatic access. Specific LLM training frameworks (e.g., PyTorch, TensorFlow) and hardware (GPUs) are needed for actual model training.
Links:
- HuggingFace Hub: https://huggingface.co/datasets/tatsu-lab/alpaca (Example for Alpaca)
- Project README: https://github.com/FreedomIntelligence/InstructionZoo

Highlighted Details

Breadth: Encompasses over 30 distinct datasets, including foundational ones like Alpaca, FLAN, and Super-Natural-Instructions, alongside specialized Chinese datasets (BELLE, FlagInstruct) and multilingual options (xP3).
Task Diversity: Covers a wide array of NLP tasks, from general instruction following and question answering to code generation, reasoning (Chain-of-Thoughts), and human value alignment.
Generation Methods: Details the various techniques used to create these datasets, including self-instruct, human annotation, translation, and leveraging LLMs like ChatGPT and GPT-4.
Data Quality: Includes datasets specifically curated to address issues like hallucinations and inconsistent outputs (e.g., Cleaned Alpaca).

Maintenance & Community

This is an ongoing project with continuous updates planned. The primary contributors are listed as FreedomIntelligence. Further community engagement details (Discord, Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

Licenses: Varies significantly by dataset, including CC BY NC 4.0, Apache License, MIT License, and others.
Compatibility: The "NC" (Non-Commercial) clauses in several popular datasets (e.g., Alpaca, Cleaned Alpaca) restrict their use in commercial products. Users must carefully check the license for each individual dataset.

Limitations & Caveats

The project is described as "on-going," indicating potential for changes and additions. Some datasets are noted as "Empty for now. Soon to update." The diverse licensing, particularly the prevalence of non-commercial clauses, poses a significant constraint for commercial adoption.

InstructionZoo by FreedomIntelligence

Explore Similar Projects

instruction-datasets by raunak-agarwal

Awesome-LLM by MLNLP-World

WizardVicunaLM by melodysdreamj

LLM-Synthetic-Data by pengr

awesome-instruction-datasets by jianzhnie

ChatGLM-LLaMA-chinese-insturct by 27182812

InstructionWild by XueFuzhao

sft_datasets by chaoswork

vigogne by bofenghuang

awesome-instruction-dataset by yaodongC

LLMDataHub by Zjh-819

KoAlpaca by Beomi