OLMo-Eval  by allenai

Evaluation suite for LLMs

created 1 year ago
355 stars

Top 79.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an evaluation framework for open language models on NLP tasks, designed for researchers and developers. It allows users to run comprehensive evaluation pipelines, compute aggregate metrics across multiple tasks, and report results, though it is now superseded by the OLMES repository.

How It Works

The framework utilizes ai2-tango and ai2-catwalk to define and execute evaluation pipelines. Users specify models and task sets (collections of NLP tasks) in configuration files. The system then runs a series of steps to generate model outputs and compute metrics, with an optional integration for reporting to Google Sheets. This approach enables efficient evaluation of multiple models across various tasks and facilitates incremental computation by reusing previous outputs.

Quick Start & Requirements

  • Install:
    conda create -n eval-pipeline python=3.10
    conda activate eval-pipeline
    cd OLMo-Eval
    pip install -e .
    
  • Prerequisites: Python 3.10, Conda.
  • Example Run:
    tango --settings tango.yml run configs/example_config.jsonnet --workspace my-eval-workspace
    
  • Documentation: configs/task_sets

Highlighted Details

  • Evaluates common models like Falcon-7b, MPT-7b, Llama2-7b, and Llama2-13b on benchmarks including MMLU.
  • Supports incremental computation, reusing previous outputs when configurations are rerun.
  • Offers optional Google Sheets integration for reporting.
  • Used for PALOMA paper evaluations.

Maintenance & Community

This repository has been superseded by the OLMES repository (https://github.com/allenai/olmes).

Licensing & Compatibility

The license is not explicitly stated in the provided README snippet.

Limitations & Caveats

This repository is deprecated and has been superseded by the OLMES repository, indicating potential lack of future development or support for OLMo-Eval itself.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.