open_lm  by mlfoundations

Language model research repo for medium-sized models (up to 7B params)

created 1 year ago
506 stars

Top 62.4% on sourcepulse

GitHubView on GitHub
Project Summary

OpenLM is a PyTorch-based language modeling repository designed for efficient research on medium-sized models (up to 7B parameters). It offers a minimal dependency set, focusing on PyTorch, XFormers, and Triton, making it accessible for researchers and practitioners looking to train or fine-tune LMs without the overhead of larger frameworks.

How It Works

OpenLM utilizes a modular design centered around PyTorch, allowing for flexible integration of performance-enhancing libraries like XFormers and Triton. The training pipeline supports distributed computation via torchrun and handles data preprocessing and loading through the webdataset package. This approach prioritizes core LM functionality and performance, enabling researchers to experiment with various model sizes and training configurations efficiently.

Quick Start & Requirements

  • Install: pip install -r requirements.txt followed by pip install --editable .
  • Prerequisites: Python >= 3.9, PyTorch, XFormers, Triton.
  • Data Preprocessing: Requires downloading and tokenizing data using provided scripts (wiki_download.py, make_2048.py).
  • Training: Uses torchrun for distributed training. Example command provided in README.
  • Evaluation: Requires llm-foundry (pip install llm-foundry).
  • Links: Quickstart, Pretrained Models

Highlighted Details

  • Supports model sizes up to 7B parameters and 250+ GPUs.
  • Includes pretrained OpenLM-1B and OpenLM-7B models.
  • Offers detailed performance benchmarks against other models like LLaMA and MPT.
  • Provides scripts for data preprocessing, training, evaluation, and text generation.

Maintenance & Community

Developed by researchers from multiple institutions including RAIVN Lab (University of Washington), UWNLP, Toyota Research Institute, and Columbia University. Code is based on open-clip and open-flamingo. Stability.ai provided resource support.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires clarification for commercial use or closed-source linking.

Limitations & Caveats

The README notes that the OpenLM-7B model is still in training, with a checkpoint released at 1.25T tokens. There's a specific note regarding positional embedding types for the pretrained OpenLM-1B model, requiring head_rotary for compatibility with older training configurations.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.1%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Travis Fischer Travis Fischer(Founder of Agentic).

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 9 months ago
updated 2 weeks ago
Feedback? Help us improve.