olmes by allenai

LLM evaluation system for reproducible research

Created 1 year ago

318 stars

Top 85.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Open Language Model Evaluation System (OLMes) provides a flexible, reproducible system for evaluating Large Language Models (LLMs) across diverse tasks. Aimed at researchers and engineers, it enables faithful reproduction of LLM evaluation results from key papers, enhancing analytical depth through detailed logging and customizable configurations.

How It Works

Building on Eleuther AI's lm-evaluation-harness, OLMes introduces deep configuration for task variants and detailed instance-level logging (e.g., logprobs). It supports custom metrics, aggregation strategies, and flexible external data storage integrations, facilitating more thorough LLM performance analysis.

Quick Start & Requirements

Installation: Clone repo, set up Python 3.10+ Conda env, pip install -e .. GPU support via pip install -e .[gpu] (requires vLLM).
Prerequisites: Python 3.10+, torch>=2.2 (potential downgrade needed).
Guidance: CLI commands like oe-eval --help offer in-tool documentation.

Highlighted Details

Reproduces results from OLMo, OLMES, TÜLU 3, OLMo 2 papers.
Supports deep task configurations and detailed instance-level logging (logprobs).
Integrates custom metrics, aggregations, and flexible output storage (GSheets, HF Datasets, S3, W&B).
Supports Huggingface, vLLM, and LiteLLM model backends.
CLI tools for setup inspection (--inspect) and command preview (--dry-run).

Maintenance & Community

The project is backed by Allen Institute for AI (AI2) and their Open Language Model efforts. Specific community channels or contributor details are not provided in the README snippet.

Licensing & Compatibility

The license is not specified in the provided README content, potentially impacting commercial use or closed-source integration.

Limitations & Caveats

No explicit limitations, bugs, or status (alpha/beta) are listed. Potential dependency management nuances (e.g., PyTorch version) may exist. The absence of license information is a key adoption caveat.

olmes by allenai

Explore Similar Projects

T-Eval by open-compass

phasellm by wgryc

code-eval by abacaj

OLMo-Eval-Legacy by allenai

arc-agi-benchmarking by arcprize

fmeval by aws

prometheus-eval by prometheus-eval

evalchemy by mlfoundations

langsmith-cookbook by langchain-ai

lighteval by huggingface

SWE-bench by SWE-bench

ragas by vibrantlabsai