olmes  by allenai

LLM evaluation system for reproducible research

Created 1 year ago
261 stars

Top 97.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Open Language Model Evaluation System (OLMes) provides a flexible, reproducible system for evaluating Large Language Models (LLMs) across diverse tasks. Aimed at researchers and engineers, it enables faithful reproduction of LLM evaluation results from key papers, enhancing analytical depth through detailed logging and customizable configurations.

How It Works

Building on Eleuther AI's lm-evaluation-harness, OLMes introduces deep configuration for task variants and detailed instance-level logging (e.g., logprobs). It supports custom metrics, aggregation strategies, and flexible external data storage integrations, facilitating more thorough LLM performance analysis.

Quick Start & Requirements

  • Installation: Clone repo, set up Python 3.10+ Conda env, pip install -e .. GPU support via pip install -e .[gpu] (requires vLLM).
  • Prerequisites: Python 3.10+, torch>=2.2 (potential downgrade needed).
  • Guidance: CLI commands like oe-eval --help offer in-tool documentation.

Highlighted Details

  • Reproduces results from OLMo, OLMES, TÜLU 3, OLMo 2 papers.
  • Supports deep task configurations and detailed instance-level logging (logprobs).
  • Integrates custom metrics, aggregations, and flexible output storage (GSheets, HF Datasets, S3, W&B).
  • Supports Huggingface, vLLM, and LiteLLM model backends.
  • CLI tools for setup inspection (--inspect) and command preview (--dry-run).

Maintenance & Community

The project is backed by Allen Institute for AI (AI2) and their Open Language Model efforts. Specific community channels or contributor details are not provided in the README snippet.

Licensing & Compatibility

The license is not specified in the provided README content, potentially impacting commercial use or closed-source integration.

Limitations & Caveats

No explicit limitations, bugs, or status (alpha/beta) are listed. Potential dependency management nuances (e.g., PyTorch version) may exist. The absence of license information is a key adoption caveat.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

0.8%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 2 years ago
Updated 3 weeks ago
Feedback? Help us improve.