chat4u by li-plus

Create a personalized chatbot from WeChat chat logs

Created 3 years ago

258 stars

Top 97.9% on SourcePulse

Project Summary

Chat4U enables users to train a personalized chatbot using their WeChat chat history. It targets individuals seeking to create a conversational AI that mimics their personal communication style, offering a unique way to leverage personal data for AI training. The primary benefit is a highly customized chatbot experience derived from one's own interactions.

How It Works

The project employs a multi-stage pipeline. First, WeChat chat data is extracted and decrypted. On macOS, this involves using wechat-decipher-macos to obtain database keys, followed by sqlcipher to decrypt msg_*.db files. For other OSes, alternative (and unverified) methods are suggested. Decrypted SQLite databases are then processed by prepare_data.py to generate a train.json dataset, currently supporting single-turn dialogues. This data is used to fine-tune a LLaMA-7B model using Stanford Alpaca's full fine-tuning approach with DeepSpeed zero3 on a GPU-enabled Linux machine.

Quick Start & Requirements

Data Extraction (macOS): Requires macOS, WeChat desktop client, nalzok/wechat-decipher-macos (dtrace script), sqlcipher (via brew install sqlcipher), and Python 3.
Training: Requires a Linux machine with NVIDIA GPUs (e.g., 8x V100-SXM2-32GB recommended for 90k samples in ~1 hour), DeepSpeed, and PyTorch.
Deployment: Requires a Docker-compatible environment for wechat-chatgpt and a Python environment for the OpenAI-compatible API server.
Links:
- wechat-decipher-macos: https://github.com/nalzok/wechat-decipher-macos
- stanford_alpaca: https://github.com/tatsu-lab/stanford_alpaca
- alpaca-lora: https://github.com/tloen/alpaca-lora
- wechat-chatgpt: https://github.com/holgots/wechat-chatgpt

Highlighted Details

Leverages existing large language models (LLaMA-7B) for fine-tuning, reducing the need for training from scratch.
Detailed instructions provided for macOS data decryption, a common pain point for WeChat data access.
Supports integration with WeChat via wechat-chatgpt and an OpenAI-compatible API server for seamless usage.
Offers alternative data extraction methods for Android, iPhone, and Windows, though these are noted as unverified.

Maintenance & Community

This repository appears to be a personal project with no explicit mention of maintainers, community channels (like Discord/Slack), or a public roadmap.

Licensing & Compatibility

The README does not explicitly state a software license for the li-plus/chat4u project itself. However, it relies on other repositories like stanford_alpaca and alpaca-lora, which have their own licenses (typically Apache 2.0 or similar permissive licenses), and uses models like LLaMA, which may have specific usage restrictions. Compatibility for commercial use would require careful review of all underlying components and models.

Limitations & Caveats

The data extraction process is heavily reliant on macOS and specific WeChat versions, posing a significant barrier for users on other operating systems. The data preparation script currently only handles single-turn dialogues, and the resulting chatbot may exhibit common sense errors despite mimicking chat style. Training requires substantial GPU resources.

chat4u by li-plus

Explore Similar Projects

mamba-chat by midrender

OpenAssistantGPT by OpenAssistantGPT

langchain-chatbot by shashankdeshpande

langchain-ui by homanp

open-claude-tag by Anil-matcha

baize-chatbot by project-baize

Chatbot by ahmadfaizalbh

ChatGPT-on-WeChat by kx-Huang

wechatbot by djun

wechat-bot by wangrongding

WeChatFerry by lich0821

wechat-chatgpt by fuergaosi233