Evaluation code for instruction-tuned LLMs
Top 59.3% on sourcepulse
InstructEval provides a framework for quantitatively evaluating instruction-tuned large language models (LLMs) on held-out tasks. It aims to simplify and standardize benchmarking across various models and datasets, enabling easier comparison of performance for models like Alpaca and Flan-T5 against larger, more costly LLMs.
How It Works
The project leverages HuggingFace Transformers to support a wide range of models, including causal and sequence-to-sequence architectures. It implements evaluation protocols for established benchmarks such as MMLU, BBH, DROP, and HumanEval, using specific prompting strategies (e.g., 5-shot for MMLU, 0-shot for HumanEval) and metrics (exact-match, pass@1). This approach facilitates direct comparison of instruction-tuned models against each other and against larger models like GPT-4.
Quick Start & Requirements
conda create -n instruct-eval python=3.8 -y
, conda activate instruct-eval
, pip install -r requirements.txt
.wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar
, tar -xf data/mmlu.tar -C data && mv data/data data/mmlu
).requirements.txt
.Highlighted Details
Maintenance & Community
The project is associated with the declare-lab research group. Further details on related projects like AlgoPuzzleVQA and Resta are available via provided links.
Licensing & Compatibility
The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The setup requires specific Python versions and manual data downloads. The README does not detail the exact list of all supported HuggingFace models, only providing examples.
1 year ago
1 week