meGPT by adrianco

LLM persona creation from personal content

Created 1 year ago

283 stars

Top 92.4% on SourcePulse

Project Summary

This repository provides a framework for creating personalized LLM "personas" based on an individual's public content, enabling chatbots that can answer questions and generate summaries in the author's unique voice. It's designed for authors, speakers, and experts looking to leverage their body of work for AI-driven interactions, offering a low-friction approach to building a digital twin.

How It Works

The project utilizes a data-driven approach where an author's public content (books, blog posts, tweets, videos, etc.) is organized and processed by specialized scripts. A central build.py script orchestrates the ingestion and preprocessing of various content types, creating a structured dataset. This dataset then serves as the foundation for training or fine-tuning an LLM, enabling it to emulate the author's style and knowledge. The use of AI-generated code with explicit context comments streamlines development.

Quick Start & Requirements

Install: Clone the repo, create a Python virtual environment (python3 -m venv venv), activate it (source venv/bin/activate or venv\Scripts\activate), and install dependencies (pip install -r requirements.txt).
Requirements: Python 3.x.
Usage: Run python build.py <author> (e.g., python build.py virtual_adrianco).
More Info: meGPT GitHub

Highlighted Details

Supports ingestion of diverse content types: books (PDF), blog posts (text), Twitter/Mastodon archives, code, presentations (images), podcasts (audio), and videos.
Includes scripts for extracting content from Medium and Blogger archives, and processing Twitter conversations.
Aims to facilitate RAG-based LLM persona development, with links to relevant presentations and experiments.
YouTube download scripts attempt multiple methods to bypass bot detection, though manual downloads may sometimes be necessary.

Maintenance & Community

The project is initiated and maintained by Adrian Colyer. Development is primarily driven by AI (ChatGPT 4, Cursor Claude Sonnet 3.7) with minimal manual code edits. Community contributions are encouraged via pull requests for content inclusion.

Licensing & Compatibility

Creative Commons - Attribution Share-Alike. Explicit permission is granted for use as training data to develop the meGPT concept. Free for use by any author/speaker/expert to create a chatbot that answers questions as if it were the author, referencing published content.

Limitations & Caveats

The project relies heavily on AI-generated code, which may require debugging or refinement by experienced Python developers. YouTube video downloading can be challenging due to sophisticated bot detection measures, potentially requiring manual intervention. Transcript quality from YouTube videos is noted as suboptimal.

meGPT by adrianco

Explore Similar Projects

Fujisaki by ljsabc

transcription_demo by hundredblocks

locomo by snap-research

huanhuan-chat by KMnO4-zx

Qmedia by QmiAI

paper_to_podcast by Azzedde

PodcastCopilot by microsoft

characterfile by elizaOS

LLM-Zero-to-Hundred by Farzad-R

awesome-ml by underlines

augmentoolkit by e-p-armstrong

awesome-chatgpt-prompts by f