chat4u  by li-plus

Create a personalized chatbot from WeChat chat logs

Created 3 years ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

Chat4U enables users to train a personalized chatbot using their WeChat chat history. It targets individuals seeking to create a conversational AI that mimics their personal communication style, offering a unique way to leverage personal data for AI training. The primary benefit is a highly customized chatbot experience derived from one's own interactions.

How It Works

The project employs a multi-stage pipeline. First, WeChat chat data is extracted and decrypted. On macOS, this involves using wechat-decipher-macos to obtain database keys, followed by sqlcipher to decrypt msg_*.db files. For other OSes, alternative (and unverified) methods are suggested. Decrypted SQLite databases are then processed by prepare_data.py to generate a train.json dataset, currently supporting single-turn dialogues. This data is used to fine-tune a LLaMA-7B model using Stanford Alpaca's full fine-tuning approach with DeepSpeed zero3 on a GPU-enabled Linux machine.

Quick Start & Requirements

Highlighted Details

  • Leverages existing large language models (LLaMA-7B) for fine-tuning, reducing the need for training from scratch.
  • Detailed instructions provided for macOS data decryption, a common pain point for WeChat data access.
  • Supports integration with WeChat via wechat-chatgpt and an OpenAI-compatible API server for seamless usage.
  • Offers alternative data extraction methods for Android, iPhone, and Windows, though these are noted as unverified.

Maintenance & Community

This repository appears to be a personal project with no explicit mention of maintainers, community channels (like Discord/Slack), or a public roadmap.

Licensing & Compatibility

The README does not explicitly state a software license for the li-plus/chat4u project itself. However, it relies on other repositories like stanford_alpaca and alpaca-lora, which have their own licenses (typically Apache 2.0 or similar permissive licenses), and uses models like LLaMA, which may have specific usage restrictions. Compatibility for commercial use would require careful review of all underlying components and models.

Limitations & Caveats

The data extraction process is heavily reliant on macOS and specific WeChat versions, posing a significant barrier for users on other operating systems. The data preparation script currently only handles single-turn dialogues, and the resulting chatbot may exhibit common sense errors despite mimicking chat style. Training requires substantial GPU resources.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.