chat-dataset-baseline by hikariming

Fine-tuned chat model and dataset for Chinese dialogue

Created 2 years ago

1,197 stars

Top 32.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

This repository provides a curated Chinese dialogue dataset and fine-tuning scripts for conversational AI models, targeting developers and researchers aiming to build high-quality Chinese language models. It offers a streamlined process for training foundational models that can be further customized for specific applications.

How It Works

The project leverages the LLaMA-Factory framework for model training. It focuses on integrating and refining top-tier Chinese datasets from Hugging Face. The core process involves data preprocessing, followed by fine-tuning using either LoRA or full parameter Supervised Fine-Tuning (SFT), as supported by LLaMA-Factory. This approach aims to provide a robust starting point for Chinese conversational AI development.

Quick Start & Requirements

Installation: Clone the repository and install LLaMA-Factory as per its official documentation.
Dataset: Download the dataset from the provided Baidu Netdisk link.
Configuration: Modify preprocess.py to set model name and author.
Training: Replace LLaMA-Factory's data folder with this project's dataset, place train.py or train.sh in the LLaMA-Factory directory, and run the training scripts.
Prerequisites: LLaMA-Factory, Python, and potentially GPU resources for efficient training.

Highlighted Details

Provides a curated, high-quality Chinese dialogue dataset.
Offers fine-tuning scripts for both LoRA and full SFT.
Based on the established LLaMA-Factory framework.
Iterative development following a PDCA cycle (Plan, Do, Check, Act).

Maintenance & Community

The project is actively under development. Community engagement is encouraged via GitHub stars.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. The citation lists authors and a GitHub repository, implying a permissive open-source license, but this requires verification.

Limitations & Caveats

The dataset is hosted on Baidu Netdisk, which may have regional access limitations. The project is described as "evolving," suggesting potential for ongoing changes and instability. The specific license for commercial use or closed-source linking is not detailed.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days