Lightweight library for evaluating language models
Top 12.8% on sourcepulse
This library provides a lightweight framework for evaluating language models, focusing on zero-shot, chain-of-thought prompting for realistic performance assessment. It targets researchers and developers needing transparent, reproducible benchmarks for LLM accuracy, offering a curated set of standard evaluations.
How It Works
The library emphasizes simple, direct instructions for evaluations, avoiding complex few-shot or role-playing prompts that can skew results for instruction-tuned models. This approach aims to better reflect real-world usage and model capabilities in a zero-shot setting.
Quick Start & Requirements
pip install openai
(for OpenAI API), pip install anthropic
(for Anthropic API).python -m simple_evals.simple_evals --model <model_name> --examples <num_examples>
git clone https://github.com/openai/human-eval
and pip install -e human-eval
.Highlighted Details
openai/evals
repository.Maintenance & Community
The repository states it will not be actively maintained, with limited acceptance of PRs for bug fixes, adding model adapters, or new eval results.
Licensing & Compatibility
Limitations & Caveats
This repository is not actively maintained and will not accept new evals. Some benchmarks (MGSM, DROP) may be saturated for newer models. Results for newer models on MATH use a newer, IID version (MATH-500).
3 weeks ago
1 day