InstructionZoo  by FreedomIntelligence

Instruction-tuning dataset collection for chat-based LLMs

created 2 years ago
280 stars

Top 93.9% on sourcepulse

GitHubView on GitHub
Project Summary

InstructionZoo is a comprehensive, ongoing collection of open-source instruction-tuning datasets designed to train chat-based Large Language Models (LLMs) like ChatGPT, LLaMA, and Alpaca. It serves researchers and developers working on LLM alignment and instruction following, offering a diverse range of English, Chinese, and multilingual datasets for various NLP tasks.

How It Works

The project curates and categorizes numerous instruction-tuning datasets, providing detailed metadata for each. This includes dataset size, language, summary, generation method, associated papers, HuggingFace links, and licensing information. The collection aims to standardize and simplify access to these valuable resources for LLM training and evaluation.

Quick Start & Requirements

Highlighted Details

  • Breadth: Encompasses over 30 distinct datasets, including foundational ones like Alpaca, FLAN, and Super-Natural-Instructions, alongside specialized Chinese datasets (BELLE, FlagInstruct) and multilingual options (xP3).
  • Task Diversity: Covers a wide array of NLP tasks, from general instruction following and question answering to code generation, reasoning (Chain-of-Thoughts), and human value alignment.
  • Generation Methods: Details the various techniques used to create these datasets, including self-instruct, human annotation, translation, and leveraging LLMs like ChatGPT and GPT-4.
  • Data Quality: Includes datasets specifically curated to address issues like hallucinations and inconsistent outputs (e.g., Cleaned Alpaca).

Maintenance & Community

This is an ongoing project with continuous updates planned. The primary contributors are listed as FreedomIntelligence. Further community engagement details (Discord, Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

  • Licenses: Varies significantly by dataset, including CC BY NC 4.0, Apache License, MIT License, and others.
  • Compatibility: The "NC" (Non-Commercial) clauses in several popular datasets (e.g., Alpaca, Cleaned Alpaca) restrict their use in commercial products. Users must carefully check the license for each individual dataset.

Limitations & Caveats

The project is described as "on-going," indicating potential for changes and additions. Some datasets are noted as "Empty for now. Soon to update." The diverse licensing, particularly the prevalence of non-commercial clauses, poses a significant constraint for commercial adoption.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.