TimeCapsuleLLM  by haykgrigo3

LLM trained on historical texts to reduce modern bias

Created 6 months ago
832 stars

Top 42.7% on SourcePulse

GitHubView on GitHub
Project Summary

This project aims to create a language model trained exclusively on historical texts to minimize modern bias and simulate the worldview of specific eras. It's targeted at researchers and developers interested in historical AI or exploring bias reduction techniques, offering a novel approach to temporal context in LLMs.

How It Works

The project leverages Andrej Karpathy's nanoGPT architecture and core training scripts. The key innovation is "Selective Temporal Training" (STT), which involves training models from scratch on curated datasets limited to specific historical time periods. This contrasts with fine-tuning, which retains inherent biases from pre-trained models. The goal is to produce models that genuinely reflect the language and knowledge of their training era, avoiding modern concepts and potential hallucinations.

Quick Start & Requirements

Highlighted Details

  • V0.5 model (trained on 1800-1875 data) exhibits Victorian writing style and grammar, with sentences generally lacking inter-sentence coherence.
  • Early models demonstrate an inability to recognize modern concepts and vocabulary, adhering strictly to the training data's temporal constraints.
  • The project emphasizes data curation and cleaning as critical steps, with ongoing efforts to improve robustness against artifacts like "Digitized by Google" footers.
  • Future plans include training with significantly larger corpora (5-10x) to explore the emergence of reasoning capabilities solely from temporal data.

Maintenance & Community

The project appears to be a personal experimental effort by haykgrigo3. Updates are posted periodically on the repository. There are no explicit links to community channels like Discord or Slack mentioned.

Licensing & Compatibility

The repository does not explicitly state a license. Given its reliance on nanoGPT, users should consult nanoGPT's license (MIT) for guidance, but the specific data curation and training methodology here may have different implications for commercial use or derivative works.

Limitations & Caveats

Current models (e.g., V0.5) are described as "sentence generators" rather than fully capable LLMs, exhibiting significant factual hallucinations and a lack of complex reasoning. The dataset size for early models is small (187MB to 500MB), limiting output coherence and complexity. Data cleaning remains an ongoing challenge.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
157 stars in the last 30 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrew Kane Andrew Kane(Author of pgvector), and
8 more.

awesome-nlp by keon

0.1%
18k
Curated list of NLP resources
Created 10 years ago
Updated 1 week ago
Feedback? Help us improve.