olmes  by allenai

LLM evaluation system for reproducible research

Created 1 year ago
318 stars

Top 85.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Open Language Model Evaluation System (OLMes) provides a flexible, reproducible system for evaluating Large Language Models (LLMs) across diverse tasks. Aimed at researchers and engineers, it enables faithful reproduction of LLM evaluation results from key papers, enhancing analytical depth through detailed logging and customizable configurations.

How It Works

Building on Eleuther AI's lm-evaluation-harness, OLMes introduces deep configuration for task variants and detailed instance-level logging (e.g., logprobs). It supports custom metrics, aggregation strategies, and flexible external data storage integrations, facilitating more thorough LLM performance analysis.

Quick Start & Requirements

  • Installation: Clone repo, set up Python 3.10+ Conda env, pip install -e .. GPU support via pip install -e .[gpu] (requires vLLM).
  • Prerequisites: Python 3.10+, torch>=2.2 (potential downgrade needed).
  • Guidance: CLI commands like oe-eval --help offer in-tool documentation.

Highlighted Details

  • Reproduces results from OLMo, OLMES, TÜLU 3, OLMo 2 papers.
  • Supports deep task configurations and detailed instance-level logging (logprobs).
  • Integrates custom metrics, aggregations, and flexible output storage (GSheets, HF Datasets, S3, W&B).
  • Supports Huggingface, vLLM, and LiteLLM model backends.
  • CLI tools for setup inspection (--inspect) and command preview (--dry-run).

Maintenance & Community

The project is backed by Allen Institute for AI (AI2) and their Open Language Model efforts. Specific community channels or contributor details are not provided in the README snippet.

Licensing & Compatibility

The license is not specified in the provided README content, potentially impacting commercial use or closed-source integration.

Limitations & Caveats

No explicit limitations, bugs, or status (alpha/beta) are listed. Potential dependency management nuances (e.g., PyTorch version) may exist. The absence of license information is a key adoption caveat.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
18 stars in the last 30 days

Explore Similar Projects

Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
8 more.

lighteval by huggingface

0.5%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 3 days ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
15 more.

SWE-bench by SWE-bench

0.8%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.