OLMo by allenai

Open language model code for training, evaluation, and inference

Created 2 years ago

6,281 stars

Top 8.1% on SourcePulse

View on GitHub

9 Experts Love This Project

Tim J. Baek

Founder of Open WebUI

Wing Lian

Founder of Axolotl AI

John Yang

Coauthor of SWE-bench, SWE-agent

Travis Fischer

Founder of Agentic

and 5 more!

Project Summary

This repository provides the code for training, evaluating, and running OLMo, Allen AI's state-of-the-art open language models. It is designed for researchers and scientists who need to understand and replicate LLM training pipelines, offering full transparency and control over the process.

How It Works

OLMo employs a two-stage training process. Stage 1 involves training on massive web-based datasets (4-5 trillion tokens) for foundational language understanding. Stage 2 refines the models on smaller, high-quality datasets (50-300 billion tokens), often involving averaging weights from multiple training runs ("soups") to improve robustness and performance. This staged approach allows for both broad knowledge acquisition and targeted specialization.

Quick Start & Requirements

Installation: pip install -e .[all] (from source) or pip install ai2-olmo (from PyPI).
Prerequisites: PyTorch (version 2.5.x recommended for training). CUDA is required for GPU acceleration.
Training: torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config}.
Inference: Hugging Face transformers library integration is available.
Resources: Training requires significant computational resources.
Docs: OLMo core (for 32B model training), OLMo Eval, olmes.

Highlighted Details

Offers pre-trained models in 7B, 13B, and 32B parameter sizes.
Supports both OLMo and Hugging Face checkpoint formats.
Includes scripts for reproducing training and deploying on Modal.com.
Provides instruction-tuned variants of the models.

Maintenance & Community

The project is led by Allen AI. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The provided BibTeX entry suggests a release in 2024. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The 32B model training requires the separate OLMo-core repository. Training configurations assume data is streamed over HTTP, recommending local downloads for large-scale reproduction. PyTorch version 2.5.x is required for training.

Health Check

Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

84 stars in the last 30 days