Open language model code for training, evaluation, and inference
Top 9.0% on sourcepulse
This repository provides the code for training, evaluating, and running OLMo, Allen AI's state-of-the-art open language models. It is designed for researchers and scientists who need to understand and replicate LLM training pipelines, offering full transparency and control over the process.
How It Works
OLMo employs a two-stage training process. Stage 1 involves training on massive web-based datasets (4-5 trillion tokens) for foundational language understanding. Stage 2 refines the models on smaller, high-quality datasets (50-300 billion tokens), often involving averaging weights from multiple training runs ("soups") to improve robustness and performance. This staged approach allows for both broad knowledge acquisition and targeted specialization.
Quick Start & Requirements
pip install -e .[all]
(from source) or pip install ai2-olmo
(from PyPI).torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config}
.transformers
library integration is available.Highlighted Details
Maintenance & Community
The project is led by Allen AI. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license. The provided BibTeX entry suggests a release in 2024. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The 32B model training requires the separate OLMo-core repository. Training configurations assume data is streamed over HTTP, recommending local downloads for large-scale reproduction. PyTorch version 2.5.x is required for training.
3 days ago
1 day