Open-source Python framework for holistic evaluation of foundation models
Top 19.8% on sourcepulse
HELM (Holistic Evaluation of Language Models) is a Python framework for comprehensive, reproducible, and transparent evaluation of foundation models, including LLMs and multimodal models. It provides standardized datasets, a unified interface for various models, and metrics beyond accuracy, targeting researchers and developers needing robust model assessment.
How It Works
HELM standardizes the evaluation process by offering a unified interface to access diverse models (e.g., OpenAI, Anthropic, Google) and a consistent format for numerous datasets and benchmarks (e.g., MMLU-Pro, GPQA, WildBench). It incorporates metrics for accuracy, efficiency, bias, and toxicity, enabling a holistic view of model performance.
Quick Start & Requirements
pip install crfm-helm
helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10
helm-summarize --suite my-suite
helm-server --suite my-suite
(access at http://localhost:8000/
)Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 day ago
1 week