Fine-tuned chat model and dataset for Chinese dialogue
Top 33.6% on sourcepulse
This repository provides a curated Chinese dialogue dataset and fine-tuning scripts for conversational AI models, targeting developers and researchers aiming to build high-quality Chinese language models. It offers a streamlined process for training foundational models that can be further customized for specific applications.
How It Works
The project leverages the LLaMA-Factory framework for model training. It focuses on integrating and refining top-tier Chinese datasets from Hugging Face. The core process involves data preprocessing, followed by fine-tuning using either LoRA or full parameter Supervised Fine-Tuning (SFT), as supported by LLaMA-Factory. This approach aims to provide a robust starting point for Chinese conversational AI development.
Quick Start & Requirements
preprocess.py
to set model name and author.data
folder with this project's dataset, place train.py
or train.sh
in the LLaMA-Factory directory, and run the training scripts.Highlighted Details
Maintenance & Community
The project is actively under development. Community engagement is encouraged via GitHub stars.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README. The citation lists authors and a GitHub repository, implying a permissive open-source license, but this requires verification.
Limitations & Caveats
The dataset is hosted on Baidu Netdisk, which may have regional access limitations. The project is described as "evolving," suggesting potential for ongoing changes and instability. The specific license for commercial use or closed-source linking is not detailed.
3 months ago
1 day