open_lm by mlfoundations

Language model research repo for medium-sized models (up to 7B params)

Created 2 years ago

524 stars

Top 60.1% on SourcePulse

View on GitHub

8 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Alex Chen

Cofounder of Nexa AI

Vincent Weisser

Cofounder of Prime Intellect

Jeremy Howard

Cofounder of fast.ai

and 4 more!

Project Summary

OpenLM is a PyTorch-based language modeling repository designed for efficient research on medium-sized models (up to 7B parameters). It offers a minimal dependency set, focusing on PyTorch, XFormers, and Triton, making it accessible for researchers and practitioners looking to train or fine-tune LMs without the overhead of larger frameworks.

How It Works

OpenLM utilizes a modular design centered around PyTorch, allowing for flexible integration of performance-enhancing libraries like XFormers and Triton. The training pipeline supports distributed computation via torchrun and handles data preprocessing and loading through the webdataset package. This approach prioritizes core LM functionality and performance, enabling researchers to experiment with various model sizes and training configurations efficiently.

Quick Start & Requirements

Install: pip install -r requirements.txt followed by pip install --editable .
Prerequisites: Python >= 3.9, PyTorch, XFormers, Triton.
Data Preprocessing: Requires downloading and tokenizing data using provided scripts (wiki_download.py, make_2048.py).
Training: Uses torchrun for distributed training. Example command provided in README.
Evaluation: Requires llm-foundry (pip install llm-foundry).
Links: Quickstart, Pretrained Models

Highlighted Details

Supports model sizes up to 7B parameters and 250+ GPUs.
Includes pretrained OpenLM-1B and OpenLM-7B models.
Offers detailed performance benchmarks against other models like LLaMA and MPT.
Provides scripts for data preprocessing, training, evaluation, and text generation.

Maintenance & Community

Developed by researchers from multiple institutions including RAIVN Lab (University of Washington), UWNLP, Toyota Research Institute, and Columbia University. Code is based on open-clip and open-flamingo. Stability.ai provided resource support.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires clarification for commercial use or closed-source linking.

Limitations & Caveats

The README notes that the OpenLM-7B model is still in training, with a checkpoint released at 1.25T tokens. There's a specific note regarding positional embedding types for the pretrained OpenLM-1B model, requiring head_rotary for compatibility with older training configurations.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days