instruct-eval  by declare-lab

Evaluation code for instruction-tuned LLMs

created 2 years ago
546 stars

Top 59.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

InstructEval provides a framework for quantitatively evaluating instruction-tuned large language models (LLMs) on held-out tasks. It aims to simplify and standardize benchmarking across various models and datasets, enabling easier comparison of performance for models like Alpaca and Flan-T5 against larger, more costly LLMs.

How It Works

The project leverages HuggingFace Transformers to support a wide range of models, including causal and sequence-to-sequence architectures. It implements evaluation protocols for established benchmarks such as MMLU, BBH, DROP, and HumanEval, using specific prompting strategies (e.g., 5-shot for MMLU, 0-shot for HumanEval) and metrics (exact-match, pass@1). This approach facilitates direct comparison of instruction-tuned models against each other and against larger models like GPT-4.

Quick Start & Requirements

  • Install: conda create -n instruct-eval python=3.8 -y, conda activate instruct-eval, pip install -r requirements.txt.
  • Data: Requires downloading the MMLU dataset (wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar, tar -xf data/mmlu.tar -C data && mv data/data data/mmlu).
  • Dependencies: Python 3.8, HuggingFace Transformers, and specific libraries listed in requirements.txt.

Highlighted Details

  • Supports a broad range of HuggingFace models (GPT-2, GPT-J, OPT-IML, BLOOMZ, Flan-T5, LLaMA, Alpaca, Vicuna, ChatGLM).
  • Includes a leaderboard for tracking model performance across benchmarks.
  • Introduces the IMPACT dataset for evaluating writing capabilities (Informative, Professional, Argumentative, Creative).
  • Recently added Red-Eval for safety evaluation, reporting high jailbreaking success rates for GPT-4 and ChatGPT.

Maintenance & Community

The project is associated with the declare-lab research group. Further details on related projects like AlgoPuzzleVQA and Resta are available via provided links.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The setup requires specific Python versions and manual data downloads. The README does not detail the exact list of all supported HuggingFace models, only providing examples.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code), Daniel Han Daniel Han(Cofounder of Unsloth), and
4 more.

open-instruct by allenai

0.2%
3k
Training codebase for instruction-following language models
created 2 years ago
updated 12 hours ago
Feedback? Help us improve.