sft_datasets  by chaoswork

SFT datasets for instruction tuning

created 2 years ago
530 stars

Top 60.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository curates a comprehensive collection of open-source datasets for Supervised Fine-Tuning (SFT) of Large Language Models (LLMs). It targets researchers and developers working on LLM alignment and instruction tuning, providing a centralized resource for diverse tasks and languages.

How It Works

The project acts as a catalog, listing numerous SFT datasets with details on their size, language, task type (e.g., Machine Translation, Text Summarization, Instruction Following), generation method (e.g., Self-Instruct, Collection, Distillation), and data source. It facilitates discovery and access to a wide array of training data for LLMs.

Quick Start & Requirements

  • Datasets are accessed via provided download links.
  • No specific software installation is required to browse or download the dataset metadata. Actual dataset usage will depend on the specific LLM training framework.

Highlighted Details

  • Extensive coverage of Chinese (CN) datasets, including Belle, Firefly, GAOKAO, and COIG.
  • Includes English (EN) and mixed-language datasets like Alpaca, Dolly 2.0, and ShareChat.
  • Covers a broad spectrum of NLP tasks, from general instruction following and dialogue to specialized areas like code generation and financial Q&A.
  • Details data generation methods, including human annotation, distillation from powerful models (GPT-3, GPT-4), and self-instruct techniques.

Maintenance & Community

  • The project is community-driven, with a call for contributions to expand the dataset list.
  • Links to download are provided for each dataset.

Licensing & Compatibility

  • Dataset licenses vary by source; users must verify individual dataset licenses for compatibility with their intended use, especially for commercial applications.

Limitations & Caveats

  • The repository itself is a catalog; it does not host the datasets directly. Users must follow individual links to download.
  • Dataset quality and suitability for specific LLM training tasks depend on the original source and generation methodology.
Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.