TimeCapsuleLLM by haykgrigo3

LLM trained on historical texts to reduce modern bias

Created 6 months ago

832 stars

Top 42.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

John Resig

Author of jQuery; Chief Software Architect at Khan Academy

Project Summary

This project aims to create a language model trained exclusively on historical texts to minimize modern bias and simulate the worldview of specific eras. It's targeted at researchers and developers interested in historical AI or exploring bias reduction techniques, offering a novel approach to temporal context in LLMs.

How It Works

The project leverages Andrej Karpathy's nanoGPT architecture and core training scripts. The key innovation is "Selective Temporal Training" (STT), which involves training models from scratch on curated datasets limited to specific historical time periods. This contrasts with fine-tuning, which retains inherent biases from pre-trained models. The goal is to produce models that genuinely reflect the language and knowledge of their training era, avoiding modern concepts and potential hallucinations.

Quick Start & Requirements

Install/Run: Follow the project's steps for data gathering, tokenizer training, and then refer to nanoGPT for the LLM training process.
Prerequisites: Python, nanoGPT, custom tokenizer files (vocab.json, merges.txt). Training was performed on a GeForce RTX 4060 GPU with an i5-13400F CPU and 16GB DDR5 RAM.
Data: Requires gathering and cleaning .txt files from public domain sources within a chosen time period. A script download_texts_improved.py is provided.
Links:
- nanoGPT: https://github.com/karpathy/nanoGPT
- Document List: https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/Copy%20of%20London%20Documents%20for%20Time%20Capsule%20LLM.txt

Highlighted Details

V0.5 model (trained on 1800-1875 data) exhibits Victorian writing style and grammar, with sentences generally lacking inter-sentence coherence.
Early models demonstrate an inability to recognize modern concepts and vocabulary, adhering strictly to the training data's temporal constraints.
The project emphasizes data curation and cleaning as critical steps, with ongoing efforts to improve robustness against artifacts like "Digitized by Google" footers.
Future plans include training with significantly larger corpora (5-10x) to explore the emergence of reasoning capabilities solely from temporal data.

Maintenance & Community

The project appears to be a personal experimental effort by haykgrigo3. Updates are posted periodically on the repository. There are no explicit links to community channels like Discord or Slack mentioned.

Licensing & Compatibility

The repository does not explicitly state a license. Given its reliance on nanoGPT, users should consult nanoGPT's license (MIT) for guidance, but the specific data curation and training methodology here may have different implications for commercial use or derivative works.

Limitations & Caveats

Current models (e.g., V0.5) are described as "sentence generators" rather than fully capable LLMs, exhibiting significant factual hallucinations and a lack of complex reasoning. The dataset size for early models is small (187MB to 500MB), limiting output coherence and complexity. Data cleaning remains an ongoing challenge.

Health Check

Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

157 stars in the last 30 days