This repository provides an evaluation framework for open language models on NLP tasks, designed for researchers and developers. It allows users to run comprehensive evaluation pipelines, compute aggregate metrics across multiple tasks, and report results, though it is now superseded by the OLMES repository.
How It Works
The framework utilizes ai2-tango
and ai2-catwalk
to define and execute evaluation pipelines. Users specify models and task sets (collections of NLP tasks) in configuration files. The system then runs a series of steps to generate model outputs and compute metrics, with an optional integration for reporting to Google Sheets. This approach enables efficient evaluation of multiple models across various tasks and facilitates incremental computation by reusing previous outputs.
Quick Start & Requirements
conda create -n eval-pipeline python=3.10
conda activate eval-pipeline
cd OLMo-Eval
pip install -e .
tango --settings tango.yml run configs/example_config.jsonnet --workspace my-eval-workspace
Highlighted Details
Maintenance & Community
This repository has been superseded by the OLMES repository (https://github.com/allenai/olmes).
Licensing & Compatibility
The license is not explicitly stated in the provided README snippet.
Limitations & Caveats
This repository is deprecated and has been superseded by the OLMES repository, indicating potential lack of future development or support for OLMo-Eval itself.
3 weeks ago
Inactive