sft_datasets by chaoswork

SFT datasets for instruction tuning

Created 2 years ago

566 stars

Top 56.8% on SourcePulse

Project Summary

This repository curates a comprehensive collection of open-source datasets for Supervised Fine-Tuning (SFT) of Large Language Models (LLMs). It targets researchers and developers working on LLM alignment and instruction tuning, providing a centralized resource for diverse tasks and languages.

How It Works

The project acts as a catalog, listing numerous SFT datasets with details on their size, language, task type (e.g., Machine Translation, Text Summarization, Instruction Following), generation method (e.g., Self-Instruct, Collection, Distillation), and data source. It facilitates discovery and access to a wide array of training data for LLMs.

Quick Start & Requirements

Datasets are accessed via provided download links.
No specific software installation is required to browse or download the dataset metadata. Actual dataset usage will depend on the specific LLM training framework.

Highlighted Details

Extensive coverage of Chinese (CN) datasets, including Belle, Firefly, GAOKAO, and COIG.
Includes English (EN) and mixed-language datasets like Alpaca, Dolly 2.0, and ShareChat.
Covers a broad spectrum of NLP tasks, from general instruction following and dialogue to specialized areas like code generation and financial Q&A.
Details data generation methods, including human annotation, distillation from powerful models (GPT-3, GPT-4), and self-instruct techniques.

Maintenance & Community

The project is community-driven, with a call for contributions to expand the dataset list.
Links to download are provided for each dataset.

Licensing & Compatibility

Dataset licenses vary by source; users must verify individual dataset licenses for compatibility with their intended use, especially for commercial applications.

Limitations & Caveats

The repository itself is a catalog; it does not host the datasets directly. Users must follow individual links to download.
Dataset quality and suitability for specific LLM training tasks depend on the original source and generation methodology.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days