history-llms  by DGoettlich

Historical LLMs trained on time-locked data

Created 4 weeks ago

New!

1,337 stars

Top 29.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project trains large language models (LLMs) exclusively on historical, time-stamped data to serve as research tools for humanities and social sciences. By creating "time-locked" models with specific knowledge cutoffs, it enables exploration of historical discourse patterns without modern hindsight contamination, offering a unique window into past thought.

How It Works

The core approach involves training LLMs from scratch on massive historical text corpora (80B-600B+ tokens) with defined knowledge cutoffs (e.g., 1913, 1929). This creates "time-locked" models that embody the textual culture of their era, preventing hindsight contamination inherent in general-purpose LLMs. A key design goal is "uncontaminated bootstrapping," minimizing interference with normative judgments acquired during pretraining for scientific applications.

Quick Start & Requirements

Artifacts, data, and repositories are slated for future public release. Installation, specific hardware requirements (e.g., GPU, CUDA), and setup procedures are not yet detailed. Links to official quick-start guides or demos are not provided in the README.

Highlighted Details

  • Upcoming Ranke-4B models (4B parameters, Qwen3 architecture) trained on 80B tokens up to various historical cutoffs.
  • Utilizes curated datasets of up to 600B time-stamped tokens.
  • Designed as "compressed representations of massive textual corpora" for discourse pattern analysis.
  • Focus on scientific applications with a "responsible access framework" for sensitive content.

Maintenance & Community

Developed by researchers from the University of Zurich and Cologne University. Acknowledges research credits from Lambda AI. Community engagement is invited via history-llms@econ.uzh.ch for input on research directions and access frameworks.

Licensing & Compatibility

The README does not specify a software license. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

Models will reproduce historical biases (racism, misogyny, etc.) present in training data, a feature for studying historical articulation. Access will be restricted to researchers via a scholarly framework to prevent misuse. Models represent published text, not public opinion, and are not substitutes for human interpretation. Artifacts are not yet publicly available.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
1,349 stars in the last 28 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Casper Hansen Casper Hansen(Author of AutoAWQ), and
8 more.

storm by stanford-oval

0.1%
28k
LLM system for automated knowledge curation and article generation
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.