OLMo  by allenai

Open language model code for training, evaluation, and inference

Created 2 years ago
5,989 stars

Top 8.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the code for training, evaluating, and running OLMo, Allen AI's state-of-the-art open language models. It is designed for researchers and scientists who need to understand and replicate LLM training pipelines, offering full transparency and control over the process.

How It Works

OLMo employs a two-stage training process. Stage 1 involves training on massive web-based datasets (4-5 trillion tokens) for foundational language understanding. Stage 2 refines the models on smaller, high-quality datasets (50-300 billion tokens), often involving averaging weights from multiple training runs ("soups") to improve robustness and performance. This staged approach allows for both broad knowledge acquisition and targeted specialization.

Quick Start & Requirements

  • Installation: pip install -e .[all] (from source) or pip install ai2-olmo (from PyPI).
  • Prerequisites: PyTorch (version 2.5.x recommended for training). CUDA is required for GPU acceleration.
  • Training: torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config}.
  • Inference: Hugging Face transformers library integration is available.
  • Resources: Training requires significant computational resources.
  • Docs: OLMo core (for 32B model training), OLMo Eval, olmes.

Highlighted Details

  • Offers pre-trained models in 7B, 13B, and 32B parameter sizes.
  • Supports both OLMo and Hugging Face checkpoint formats.
  • Includes scripts for reproducing training and deploying on Modal.com.
  • Provides instruction-tuned variants of the models.

Maintenance & Community

The project is led by Allen AI. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The provided BibTeX entry suggests a release in 2024. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The 32B model training requires the separate OLMo-core repository. Training configurations assume data is streamed over HTTP, recommending local downloads for large-scale reproduction. PyTorch version 2.5.x is required for training.

Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
4
Issues (30d)
8
Star History
90 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
3 more.

unified-io-2 by allenai

0.3%
626
Unified-IO 2 code for training, inference, and demo
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.