open_lm  by mlfoundations

Language model research repo for medium-sized models (up to 7B params)

Created 2 years ago
509 stars

Top 61.4% on SourcePulse

GitHubView on GitHub
Project Summary

OpenLM is a PyTorch-based language modeling repository designed for efficient research on medium-sized models (up to 7B parameters). It offers a minimal dependency set, focusing on PyTorch, XFormers, and Triton, making it accessible for researchers and practitioners looking to train or fine-tune LMs without the overhead of larger frameworks.

How It Works

OpenLM utilizes a modular design centered around PyTorch, allowing for flexible integration of performance-enhancing libraries like XFormers and Triton. The training pipeline supports distributed computation via torchrun and handles data preprocessing and loading through the webdataset package. This approach prioritizes core LM functionality and performance, enabling researchers to experiment with various model sizes and training configurations efficiently.

Quick Start & Requirements

  • Install: pip install -r requirements.txt followed by pip install --editable .
  • Prerequisites: Python >= 3.9, PyTorch, XFormers, Triton.
  • Data Preprocessing: Requires downloading and tokenizing data using provided scripts (wiki_download.py, make_2048.py).
  • Training: Uses torchrun for distributed training. Example command provided in README.
  • Evaluation: Requires llm-foundry (pip install llm-foundry).
  • Links: Quickstart, Pretrained Models

Highlighted Details

  • Supports model sizes up to 7B parameters and 250+ GPUs.
  • Includes pretrained OpenLM-1B and OpenLM-7B models.
  • Offers detailed performance benchmarks against other models like LLaMA and MPT.
  • Provides scripts for data preprocessing, training, evaluation, and text generation.

Maintenance & Community

Developed by researchers from multiple institutions including RAIVN Lab (University of Washington), UWNLP, Toyota Research Institute, and Columbia University. Code is based on open-clip and open-flamingo. Stability.ai provided resource support.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires clarification for commercial use or closed-source linking.

Limitations & Caveats

The README notes that the OpenLM-7B model is still in training, with a checkpoint released at 1.25T tokens. There's a specific note regarding positional embedding types for the pretrained OpenLM-1B model, requiring head_rotary for compatibility with older training configurations.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), François Chollet François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), and
42 more.

spaCy by explosion

0.1%
32k
NLP library for production applications
Created 11 years ago
Updated 3 months ago
Feedback? Help us improve.