TimeCapsuleLLM  by haykgrigo3

LLM trained on historical texts to reduce modern bias

created 1 month ago
269 stars

Top 96.2% on sourcepulse

GitHubView on GitHub
Project Summary

This project aims to create a language model trained exclusively on historical texts to minimize modern bias and simulate the worldview of specific eras. It's targeted at researchers and developers interested in historical AI or exploring bias reduction techniques, offering a novel approach to temporal context in LLMs.

How It Works

The project leverages Andrej Karpathy's nanoGPT architecture and core training scripts. The key innovation is "Selective Temporal Training" (STT), which involves training models from scratch on curated datasets limited to specific historical time periods. This contrasts with fine-tuning, which retains inherent biases from pre-trained models. The goal is to produce models that genuinely reflect the language and knowledge of their training era, avoiding modern concepts and potential hallucinations.

Quick Start & Requirements

Highlighted Details

  • V0.5 model (trained on 1800-1875 data) exhibits Victorian writing style and grammar, with sentences generally lacking inter-sentence coherence.
  • Early models demonstrate an inability to recognize modern concepts and vocabulary, adhering strictly to the training data's temporal constraints.
  • The project emphasizes data curation and cleaning as critical steps, with ongoing efforts to improve robustness against artifacts like "Digitized by Google" footers.
  • Future plans include training with significantly larger corpora (5-10x) to explore the emergence of reasoning capabilities solely from temporal data.

Maintenance & Community

The project appears to be a personal experimental effort by haykgrigo3. Updates are posted periodically on the repository. There are no explicit links to community channels like Discord or Slack mentioned.

Licensing & Compatibility

The repository does not explicitly state a license. Given its reliance on nanoGPT, users should consult nanoGPT's license (MIT) for guidance, but the specific data curation and training methodology here may have different implications for commercial use or derivative works.

Limitations & Caveats

Current models (e.g., V0.5) are described as "sentence generators" rather than fully capable LLMs, exhibiting significant factual hallucinations and a lack of complex reasoning. The dataset size for early models is small (187MB to 500MB), limiting output coherence and complexity. Data cleaning remains an ongoing challenge.

Health Check
Last commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
271 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n) and Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm).

mlx-gpt2 by pranavjad

0.5%
393
Minimal GPT-2 implementation for educational purposes
created 1 year ago
updated 1 year ago
Feedback? Help us improve.