LLM persona creation from personal content
Top 95.2% on sourcepulse
This repository provides a framework for creating personalized LLM "personas" based on an individual's public content, enabling chatbots that can answer questions and generate summaries in the author's unique voice. It's designed for authors, speakers, and experts looking to leverage their body of work for AI-driven interactions, offering a low-friction approach to building a digital twin.
How It Works
The project utilizes a data-driven approach where an author's public content (books, blog posts, tweets, videos, etc.) is organized and processed by specialized scripts. A central build.py
script orchestrates the ingestion and preprocessing of various content types, creating a structured dataset. This dataset then serves as the foundation for training or fine-tuning an LLM, enabling it to emulate the author's style and knowledge. The use of AI-generated code with explicit context comments streamlines development.
Quick Start & Requirements
python3 -m venv venv
), activate it (source venv/bin/activate
or venv\Scripts\activate
), and install dependencies (pip install -r requirements.txt
).python build.py <author>
(e.g., python build.py virtual_adrianco
).Highlighted Details
Maintenance & Community
The project is initiated and maintained by Adrian Colyer. Development is primarily driven by AI (ChatGPT 4, Cursor Claude Sonnet 3.7) with minimal manual code edits. Community contributions are encouraged via pull requests for content inclusion.
Licensing & Compatibility
Creative Commons - Attribution Share-Alike. Explicit permission is granted for use as training data to develop the meGPT concept. Free for use by any author/speaker/expert to create a chatbot that answers questions as if it were the author, referencing published content.
Limitations & Caveats
The project relies heavily on AI-generated code, which may require debugging or refinement by experienced Python developers. YouTube video downloading can be challenging due to sophisticated bot detection measures, potentially requiring manual intervention. Transcript quality from YouTube videos is noted as suboptimal.
1 month ago
Inactive