Fujisaki  by ljsabc

Create a digital doppelgänger from your Twitter archive

created 2 years ago
322 stars

Top 85.5% on sourcepulse

GitHubView on GitHub
Project Summary

This project aims to create a personalized AI doppelgänger using a user's Twitter archive and LoRA fine-tuning. It's designed for individuals interested in digital immortality, content generation, and prompt engineering, offering a unique way to preserve and interact with a digital representation of oneself.

How It Works

The project processes a user's Twitter archive to generate an instruction-style JSON dataset. This dataset is then used to fine-tune a Chinese language model (currently ChatGLM) with LoRA. The approach prioritizes sampling for varied responses, aiming for higher interactivity than simple retrieval or Q&A systems. Advanced features include parsing replies for context and using OpenAI to augment training data with questions or preambles.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Recommended: conda install cudatoolkit=11.3 (or 11.8 for RTX 4090).
  • Requires a Twitter archive (HTML file) placed in the project root.
  • Optional: OpenAI API key for data augmentation.
  • Setup involves parsing the archive, tokenizing data, and then training.
  • Demo: HuggingFace Hub, Colab notebook.

Highlighted Details

  • Fine-tuning leverages LoRA for efficient adaptation of large language models.
  • Supports multi-GPU training for faster iteration.
  • Offers options for parsing replies and using OpenAI for enhanced context.
  • Training a 75,000-tweet dataset on an A100 takes ~3 hours per epoch.

Maintenance & Community

The project is an ongoing, fast-prototyping effort. Discussions are encouraged in the community channels (Discord/Slack links not provided in README).

Licensing & Compatibility

The project's licensing is not explicitly stated in the README, but it references projects with various licenses (e.g., MIT, Apache 2.0). Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

The project is in early stages with fast prototyping. Parsing replies requires Selenium and a proxy pool, and cannot retrieve deleted tweets. The dialogue function is limited by the nature of tweet data (mostly declarative sentences).

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.