Chinese instruction fine-tuning dataset
Top 74.5% on sourcepulse
This repository provides a Chinese instruction-following dataset designed for fine-tuning large language models, inspired by the Stanford Alpaca dataset. It aims to facilitate the development of Chinese-capable instruction-following models for researchers and developers working with Chinese NLP.
How It Works
The dataset consists of instruction-output pairs, with optional input fields, formatted in JSON. The generation methodology is based on the self-instruct approach, leveraging machine translation and data cleaning techniques to create diverse and high-quality instruction-following examples in Chinese.
Highlighted Details
Maintenance & Community
Information regarding contributors, community channels, or roadmap is not available in the provided README.
Licensing & Compatibility
The license is not specified in the README. Compatibility for commercial use or closed-source linking is undetermined.
Limitations & Caveats
The README indicates that the generation method for each data entry and the specific keyword/rule cleaning processes are pending further documentation.
2 years ago
1 day