meGPT  by adrianco

LLM persona creation from personal content

created 1 year ago
274 stars

Top 95.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a framework for creating personalized LLM "personas" based on an individual's public content, enabling chatbots that can answer questions and generate summaries in the author's unique voice. It's designed for authors, speakers, and experts looking to leverage their body of work for AI-driven interactions, offering a low-friction approach to building a digital twin.

How It Works

The project utilizes a data-driven approach where an author's public content (books, blog posts, tweets, videos, etc.) is organized and processed by specialized scripts. A central build.py script orchestrates the ingestion and preprocessing of various content types, creating a structured dataset. This dataset then serves as the foundation for training or fine-tuning an LLM, enabling it to emulate the author's style and knowledge. The use of AI-generated code with explicit context comments streamlines development.

Quick Start & Requirements

  • Install: Clone the repo, create a Python virtual environment (python3 -m venv venv), activate it (source venv/bin/activate or venv\Scripts\activate), and install dependencies (pip install -r requirements.txt).
  • Requirements: Python 3.x.
  • Usage: Run python build.py <author> (e.g., python build.py virtual_adrianco).
  • More Info: meGPT GitHub

Highlighted Details

  • Supports ingestion of diverse content types: books (PDF), blog posts (text), Twitter/Mastodon archives, code, presentations (images), podcasts (audio), and videos.
  • Includes scripts for extracting content from Medium and Blogger archives, and processing Twitter conversations.
  • Aims to facilitate RAG-based LLM persona development, with links to relevant presentations and experiments.
  • YouTube download scripts attempt multiple methods to bypass bot detection, though manual downloads may sometimes be necessary.

Maintenance & Community

The project is initiated and maintained by Adrian Colyer. Development is primarily driven by AI (ChatGPT 4, Cursor Claude Sonnet 3.7) with minimal manual code edits. Community contributions are encouraged via pull requests for content inclusion.

Licensing & Compatibility

Creative Commons - Attribution Share-Alike. Explicit permission is granted for use as training data to develop the meGPT concept. Free for use by any author/speaker/expert to create a chatbot that answers questions as if it were the author, referencing published content.

Limitations & Caveats

The project relies heavily on AI-generated code, which may require debugging or refinement by experienced Python developers. YouTube video downloading can be challenging due to sophisticated bot detection measures, potentially requiring manual intervention. Transcript quality from YouTube videos is noted as suboptimal.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.