Fujisaki by ljsabc

Create a digital doppelgänger from your Twitter archive

Created 2 years ago

323 stars

Top 84.3% on SourcePulse

Project Summary

This project aims to create a personalized AI doppelgänger using a user's Twitter archive and LoRA fine-tuning. It's designed for individuals interested in digital immortality, content generation, and prompt engineering, offering a unique way to preserve and interact with a digital representation of oneself.

How It Works

The project processes a user's Twitter archive to generate an instruction-style JSON dataset. This dataset is then used to fine-tune a Chinese language model (currently ChatGLM) with LoRA. The approach prioritizes sampling for varied responses, aiming for higher interactivity than simple retrieval or Q&A systems. Advanced features include parsing replies for context and using OpenAI to augment training data with questions or preambles.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Recommended: conda install cudatoolkit=11.3 (or 11.8 for RTX 4090).
Requires a Twitter archive (HTML file) placed in the project root.
Optional: OpenAI API key for data augmentation.
Setup involves parsing the archive, tokenizing data, and then training.
Demo: HuggingFace Hub, Colab notebook.

Highlighted Details

Fine-tuning leverages LoRA for efficient adaptation of large language models.
Supports multi-GPU training for faster iteration.
Offers options for parsing replies and using OpenAI for enhanced context.
Training a 75,000-tweet dataset on an A100 takes ~3 hours per epoch.

Maintenance & Community

The project is an ongoing, fast-prototyping effort. Discussions are encouraged in the community channels (Discord/Slack links not provided in README).

Licensing & Compatibility

The project's licensing is not explicitly stated in the README, but it references projects with various licenses (e.g., MIT, Apache 2.0). Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

The project is in early stages with fast prototyping. Parsing replies requires Selenium and a proxy pool, and cannot retrieve deleted tweets. The dialogue function is limited by the nature of tweet data (mostly declarative sentences).

Health Check

Last Commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days