chat-dataset-baseline  by hikariming

Fine-tuned chat model and dataset for Chinese dialogue

created 2 years ago
1,184 stars

Top 33.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a curated Chinese dialogue dataset and fine-tuning scripts for conversational AI models, targeting developers and researchers aiming to build high-quality Chinese language models. It offers a streamlined process for training foundational models that can be further customized for specific applications.

How It Works

The project leverages the LLaMA-Factory framework for model training. It focuses on integrating and refining top-tier Chinese datasets from Hugging Face. The core process involves data preprocessing, followed by fine-tuning using either LoRA or full parameter Supervised Fine-Tuning (SFT), as supported by LLaMA-Factory. This approach aims to provide a robust starting point for Chinese conversational AI development.

Quick Start & Requirements

  • Installation: Clone the repository and install LLaMA-Factory as per its official documentation.
  • Dataset: Download the dataset from the provided Baidu Netdisk link.
  • Configuration: Modify preprocess.py to set model name and author.
  • Training: Replace LLaMA-Factory's data folder with this project's dataset, place train.py or train.sh in the LLaMA-Factory directory, and run the training scripts.
  • Prerequisites: LLaMA-Factory, Python, and potentially GPU resources for efficient training.

Highlighted Details

  • Provides a curated, high-quality Chinese dialogue dataset.
  • Offers fine-tuning scripts for both LoRA and full SFT.
  • Based on the established LLaMA-Factory framework.
  • Iterative development following a PDCA cycle (Plan, Do, Check, Act).

Maintenance & Community

The project is actively under development. Community engagement is encouraged via GitHub stars.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. The citation lists authors and a GitHub repository, implying a permissive open-source license, but this requires verification.

Limitations & Caveats

The dataset is hosted on Baidu Netdisk, which may have regional access limitations. The project is described as "evolving," suggesting potential for ongoing changes and instability. The specific license for commercial use or closed-source linking is not detailed.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.