LLM trained on historical texts to reduce modern bias
Top 96.2% on sourcepulse
This project aims to create a language model trained exclusively on historical texts to minimize modern bias and simulate the worldview of specific eras. It's targeted at researchers and developers interested in historical AI or exploring bias reduction techniques, offering a novel approach to temporal context in LLMs.
How It Works
The project leverages Andrej Karpathy's nanoGPT architecture and core training scripts. The key innovation is "Selective Temporal Training" (STT), which involves training models from scratch on curated datasets limited to specific historical time periods. This contrasts with fine-tuning, which retains inherent biases from pre-trained models. The goal is to produce models that genuinely reflect the language and knowledge of their training era, avoiding modern concepts and potential hallucinations.
Quick Start & Requirements
vocab.json
, merges.txt
). Training was performed on a GeForce RTX 4060 GPU with an i5-13400F CPU and 16GB DDR5 RAM..txt
files from public domain sources within a chosen time period. A script download_texts_improved.py
is provided.Highlighted Details
Maintenance & Community
The project appears to be a personal experimental effort by haykgrigo3
. Updates are posted periodically on the repository. There are no explicit links to community channels like Discord or Slack mentioned.
Licensing & Compatibility
The repository does not explicitly state a license. Given its reliance on nanoGPT, users should consult nanoGPT's license (MIT) for guidance, but the specific data curation and training methodology here may have different implications for commercial use or derivative works.
Limitations & Caveats
Current models (e.g., V0.5) are described as "sentence generators" rather than fully capable LLMs, exhibiting significant factual hallucinations and a lack of complex reasoning. The dataset size for early models is small (187MB to 500MB), limiting output coherence and complexity. Data cleaning remains an ongoing challenge.
1 day ago
Inactive