huanhuan-chat  by KMnO4-zx

Chatbot for emulating a character's speech from a TV drama via LoRA fine-tuning

created 2 years ago
709 stars

Top 49.3% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a complete workflow for fine-tuning large language models (LLMs) on custom script data to create character-specific chatbots. It targets users interested in personalized AI experiences, particularly those familiar with Chinese historical dramas, by enabling them to replicate the speech patterns of characters like Zhen Huan from the popular TV series "Empresses in the Palace."

How It Works

The project leverages LoRA (Low-Rank Adaptation) fine-tuning on a base LLM (specifically Llama 3.1 8B Instruct is demonstrated) using dialogue data extracted from scripts. The core innovation lies in the provided data processing pipeline, which handles converting raw script text into a structured conversational format suitable for fine-tuning. This includes extracting character-dialogue pairs and formatting them into instruction-response datasets, with suggestions for data augmentation.

Quick Start & Requirements

  • Install: pip install modelscope transformers accelerate peft datasets
  • Prerequisites: Ubuntu 22.04, Python 3.12, PyTorch 2.3.0 with CUDA 12.1.
  • Setup: Requires downloading a base LLM (e.g., Llama 3.1 8B Instruct) and preparing custom script data. Training time is approximately 20-30 minutes.
  • Links:

Highlighted Details

  • Fine-tunes LLMs to mimic specific character speech patterns from scripts.
  • Provides a full pipeline from data preparation to model training and testing.
  • Demonstrates using LoRA for efficient fine-tuning.
  • Includes example data processing for "Empresses in the Palace" and "Journey to the West."

Maintenance & Community

The project has received recognition in AI competitions, including the 2023 iFlytek Spark Cup and the 2024 OpenBMB Challenge. Contributors are listed as members of Datawhale.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project focuses on Chinese script data and may require significant adaptation for other languages or data formats. The effectiveness of the data extraction and augmentation steps can vary depending on the quality and structure of the input script.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
48 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.