alpaca-chinese-dataset  by carbonz0

Chinese instruction fine-tuning dataset

Created 2 years ago
395 stars

Top 73.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a Chinese instruction-following dataset designed for fine-tuning large language models, inspired by the Stanford Alpaca dataset. It aims to facilitate the development of Chinese-capable instruction-following models for researchers and developers working with Chinese NLP.

How It Works

The dataset consists of instruction-output pairs, with optional input fields, formatted in JSON. The generation methodology is based on the self-instruct approach, leveraging machine translation and data cleaning techniques to create diverse and high-quality instruction-following examples in Chinese.

Highlighted Details

  • Dataset format is identical to the original Alpaca data (JSON).
  • Includes examples for various tasks like information retrieval, list generation, and hidden message extraction.
  • Generation methods include machine translation and self-instruct.

Maintenance & Community

Information regarding contributors, community channels, or roadmap is not available in the provided README.

Licensing & Compatibility

The license is not specified in the README. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

The README indicates that the generation method for each data entry and the specific keyword/rule cleaning processes are pending further documentation.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), John Yang John Yang(Coauthor of SWE-bench, SWE-agent), and
28 more.

stanford_alpaca by tatsu-lab

0.1%
30k
Instruction-following LLaMA model training and data generation
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.