instruct-eval by declare-lab

Evaluation code for instruction-tuned LLMs

Created 2 years ago

551 stars

Top 58.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Wei-Lin Chiang

Cofounder of LMArena

Project Summary

InstructEval provides a framework for quantitatively evaluating instruction-tuned large language models (LLMs) on held-out tasks. It aims to simplify and standardize benchmarking across various models and datasets, enabling easier comparison of performance for models like Alpaca and Flan-T5 against larger, more costly LLMs.

How It Works

The project leverages HuggingFace Transformers to support a wide range of models, including causal and sequence-to-sequence architectures. It implements evaluation protocols for established benchmarks such as MMLU, BBH, DROP, and HumanEval, using specific prompting strategies (e.g., 5-shot for MMLU, 0-shot for HumanEval) and metrics (exact-match, pass@1). This approach facilitates direct comparison of instruction-tuned models against each other and against larger models like GPT-4.

Quick Start & Requirements

Install: conda create -n instruct-eval python=3.8 -y, conda activate instruct-eval, pip install -r requirements.txt.
Data: Requires downloading the MMLU dataset (wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar, tar -xf data/mmlu.tar -C data && mv data/data data/mmlu).
Dependencies: Python 3.8, HuggingFace Transformers, and specific libraries listed in requirements.txt.

Highlighted Details

Supports a broad range of HuggingFace models (GPT-2, GPT-J, OPT-IML, BLOOMZ, Flan-T5, LLaMA, Alpaca, Vicuna, ChatGLM).
Includes a leaderboard for tracking model performance across benchmarks.
Introduces the IMPACT dataset for evaluating writing capabilities (Informative, Professional, Argumentative, Creative).
Recently added Red-Eval for safety evaluation, reporting high jailbreaking success rates for GPT-4 and ChatGPT.

Maintenance & Community

The project is associated with the declare-lab research group. Further details on related projects like AlgoPuzzleVQA and Resta are available via provided links.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The setup requires specific Python versions and manual data downloads. The README does not detail the exact list of all supported HuggingFace models, only providing examples.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days