alpaca-chinese-dataset by carbonz0

Chinese instruction fine-tuning dataset

Created 2 years ago

396 stars

Top 72.9% on SourcePulse

Project Summary

This repository provides a Chinese instruction-following dataset designed for fine-tuning large language models, inspired by the Stanford Alpaca dataset. It aims to facilitate the development of Chinese-capable instruction-following models for researchers and developers working with Chinese NLP.

How It Works

The dataset consists of instruction-output pairs, with optional input fields, formatted in JSON. The generation methodology is based on the self-instruct approach, leveraging machine translation and data cleaning techniques to create diverse and high-quality instruction-following examples in Chinese.

Highlighted Details

Dataset format is identical to the original Alpaca data (JSON).
Includes examples for various tasks like information retrieval, list generation, and hidden message extraction.
Generation methods include machine translation and self-instruct.

Maintenance & Community

Information regarding contributors, community channels, or roadmap is not available in the provided README.

Licensing & Compatibility

The license is not specified in the README. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

The README indicates that the generation method for each data entry and the specific keyword/rule cleaning processes are pending further documentation.

alpaca-chinese-dataset by carbonz0

Explore Similar Projects

ProX by GAIR-NLP

InstructionZoo by FreedomIntelligence

awesome-instruction-datasets by jianzhnie

FlagData by FlagOpen

awesome-instruction-dataset by yaodongC

be_great by tabularis-ai

seqio by google

Awesome-LLM-Synthetic-Data by wasiahmad

OpenCoder-llm by OpenCoder-llm

curator by bespokelabsai

self-instruct by yizhongw

stanford_alpaca by tatsu-lab