OLMo-Eval  by allenai

Evaluation suite for LLMs

Created 1 year ago
361 stars

Top 77.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides an evaluation framework for open language models on NLP tasks, designed for researchers and developers. It allows users to run comprehensive evaluation pipelines, compute aggregate metrics across multiple tasks, and report results, though it is now superseded by the OLMES repository.

How It Works

The framework utilizes ai2-tango and ai2-catwalk to define and execute evaluation pipelines. Users specify models and task sets (collections of NLP tasks) in configuration files. The system then runs a series of steps to generate model outputs and compute metrics, with an optional integration for reporting to Google Sheets. This approach enables efficient evaluation of multiple models across various tasks and facilitates incremental computation by reusing previous outputs.

Quick Start & Requirements

  • Install:
    conda create -n eval-pipeline python=3.10
    conda activate eval-pipeline
    cd OLMo-Eval
    pip install -e .
    
  • Prerequisites: Python 3.10, Conda.
  • Example Run:
    tango --settings tango.yml run configs/example_config.jsonnet --workspace my-eval-workspace
    
  • Documentation: configs/task_sets

Highlighted Details

  • Evaluates common models like Falcon-7b, MPT-7b, Llama2-7b, and Llama2-13b on benchmarks including MMLU.
  • Supports incremental computation, reusing previous outputs when configurations are rerun.
  • Offers optional Google Sheets integration for reporting.
  • Used for PALOMA paper evaluations.

Maintenance & Community

This repository has been superseded by the OLMES repository (https://github.com/allenai/olmes).

Licensing & Compatibility

The license is not explicitly stated in the provided README snippet.

Limitations & Caveats

This repository is deprecated and has been superseded by the OLMES repository, indicating potential lack of future development or support for OLMo-Eval itself.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

argilla by argilla-io

0.2%
5k
Collaboration tool for building high-quality AI datasets
Created 4 years ago
Updated 3 days ago
Feedback? Help us improve.