OLMo-Eval-Legacy by allenai

Evaluation suite for LLMs

Created 2 years ago

377 stars

Top 75.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luca Soldaini

Research Scientist at Ai2

Project Summary

This repository provides an evaluation framework for open language models on NLP tasks, designed for researchers and developers. It allows users to run comprehensive evaluation pipelines, compute aggregate metrics across multiple tasks, and report results, though it is now superseded by the OLMES repository.

How It Works

The framework utilizes ai2-tango and ai2-catwalk to define and execute evaluation pipelines. Users specify models and task sets (collections of NLP tasks) in configuration files. The system then runs a series of steps to generate model outputs and compute metrics, with an optional integration for reporting to Google Sheets. This approach enables efficient evaluation of multiple models across various tasks and facilitates incremental computation by reusing previous outputs.

Quick Start & Requirements

Install:

conda create -n eval-pipeline python=3.10
conda activate eval-pipeline
cd OLMo-Eval
pip install -e .

Prerequisites: Python 3.10, Conda.

Example Run:

tango --settings tango.yml run configs/example_config.jsonnet --workspace my-eval-workspace

Documentation: configs/task_sets

Highlighted Details

Evaluates common models like Falcon-7b, MPT-7b, Llama2-7b, and Llama2-13b on benchmarks including MMLU.
Supports incremental computation, reusing previous outputs when configurations are rerun.
Offers optional Google Sheets integration for reporting.
Used for PALOMA paper evaluations.

Maintenance & Community

This repository has been superseded by the OLMES repository (https://github.com/allenai/olmes).

Licensing & Compatibility

The license is not explicitly stated in the provided README snippet.

Limitations & Caveats

This repository is deprecated and has been superseded by the OLMES repository, indicating potential lack of future development or support for OLMo-Eval itself.

Health Check

Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days