Curated list of instruction datasets for training ChatLLMs
Top 50.4% on sourcepulse
This repository serves as a curated, comprehensive list of instruction-following datasets for training large language models (LLMs), particularly chat-based models like ChatGPT. It aims to accelerate research and development in Natural Language Processing (NLP) by providing easy access to a wide array of resources for instruction tuning and Reinforcement Learning from Human Feedback (RLHF). The target audience includes NLP researchers and developers working on LLM alignment and performance.
How It Works
The project categorizes and lists numerous instruction datasets, often detailing their source, generation method (human-generated, self-instruct, collection), language(s), task types (multi-task, task-specific), and instance counts. It also includes a separate section for RLHF datasets, highlighting human preference data crucial for aligning models with human values. The organization facilitates comparison and selection of datasets based on specific project needs.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Some datasets may have associated costs for generation (e.g., using GPT-3.5/4 APIs) or may not explicitly state their license, requiring careful due diligence by the user. The project is a living collection, and dataset availability or specific details might change.
1 year ago
Inactive