alpaca-chinese-dataset  by carbonz0

Chinese instruction fine-tuning dataset

created 2 years ago
392 stars

Top 74.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a Chinese instruction-following dataset designed for fine-tuning large language models, inspired by the Stanford Alpaca dataset. It aims to facilitate the development of Chinese-capable instruction-following models for researchers and developers working with Chinese NLP.

How It Works

The dataset consists of instruction-output pairs, with optional input fields, formatted in JSON. The generation methodology is based on the self-instruct approach, leveraging machine translation and data cleaning techniques to create diverse and high-quality instruction-following examples in Chinese.

Highlighted Details

  • Dataset format is identical to the original Alpaca data (JSON).
  • Includes examples for various tasks like information retrieval, list generation, and hidden message extraction.
  • Generation methods include machine translation and self-instruct.

Maintenance & Community

Information regarding contributors, community channels, or roadmap is not available in the provided README.

Licensing & Compatibility

The license is not specified in the README. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

The README indicates that the generation method for each data entry and the specific keyword/rule cleaning processes are pending further documentation.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.