OLMo  by allenai

Open language model code for training, evaluation, and inference

created 2 years ago
5,827 stars

Top 9.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the code for training, evaluating, and running OLMo, Allen AI's state-of-the-art open language models. It is designed for researchers and scientists who need to understand and replicate LLM training pipelines, offering full transparency and control over the process.

How It Works

OLMo employs a two-stage training process. Stage 1 involves training on massive web-based datasets (4-5 trillion tokens) for foundational language understanding. Stage 2 refines the models on smaller, high-quality datasets (50-300 billion tokens), often involving averaging weights from multiple training runs ("soups") to improve robustness and performance. This staged approach allows for both broad knowledge acquisition and targeted specialization.

Quick Start & Requirements

  • Installation: pip install -e .[all] (from source) or pip install ai2-olmo (from PyPI).
  • Prerequisites: PyTorch (version 2.5.x recommended for training). CUDA is required for GPU acceleration.
  • Training: torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config}.
  • Inference: Hugging Face transformers library integration is available.
  • Resources: Training requires significant computational resources.
  • Docs: OLMo core (for 32B model training), OLMo Eval, olmes.

Highlighted Details

  • Offers pre-trained models in 7B, 13B, and 32B parameter sizes.
  • Supports both OLMo and Hugging Face checkpoint formats.
  • Includes scripts for reproducing training and deploying on Modal.com.
  • Provides instruction-tuned variants of the models.

Maintenance & Community

The project is led by Allen AI. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The provided BibTeX entry suggests a release in 2024. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The 32B model training requires the separate OLMo-core repository. Training configurations assume data is streamed over HTTP, recommending local downloads for large-scale reproduction. PyTorch version 2.5.x is required for training.

Health Check
Last commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
14
Star History
290 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.