huanhuan-chat by KMnO4-zx

Chatbot for emulating a character's speech from a TV drama via LoRA fine-tuning

Created 2 years ago

791 stars

Top 44.3% on SourcePulse

Project Summary

This project provides a complete workflow for fine-tuning large language models (LLMs) on custom script data to create character-specific chatbots. It targets users interested in personalized AI experiences, particularly those familiar with Chinese historical dramas, by enabling them to replicate the speech patterns of characters like Zhen Huan from the popular TV series "Empresses in the Palace."

How It Works

The project leverages LoRA (Low-Rank Adaptation) fine-tuning on a base LLM (specifically Llama 3.1 8B Instruct is demonstrated) using dialogue data extracted from scripts. The core innovation lies in the provided data processing pipeline, which handles converting raw script text into a structured conversational format suitable for fine-tuning. This includes extracting character-dialogue pairs and formatting them into instruction-response datasets, with suggestions for data augmentation.

Quick Start & Requirements

Install: pip install modelscope transformers accelerate peft datasets
Prerequisites: Ubuntu 22.04, Python 3.12, PyTorch 2.3.0 with CUDA 12.1.
Setup: Requires downloading a base LLM (e.g., Llama 3.1 8B Instruct) and preparing custom script data. Training time is approximately 20-30 minutes.
Links:
- Modelscope: Link
- Project: Link
- Xlab: Link

Highlighted Details

Fine-tunes LLMs to mimic specific character speech patterns from scripts.
Provides a full pipeline from data preparation to model training and testing.
Demonstrates using LoRA for efficient fine-tuning.
Includes example data processing for "Empresses in the Palace" and "Journey to the West."

Maintenance & Community

The project has received recognition in AI competitions, including the 2023 iFlytek Spark Cup and the 2024 OpenBMB Challenge. Contributors are listed as members of Datawhale.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project focuses on Chinese script data and may require significant adaptation for other languages or data formats. The effectiveness of the data extraction and augmentation steps can vary depending on the quality and structure of the input script.

Health Check

Last Commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days