helm by stanford-crfm

Open-source Python framework for holistic evaluation of foundation models

Created 4 years ago

2,617 stars

Top 17.8% on SourcePulse

View on GitHub

15 Experts Love This Project

Tony Lee

Author of HELM; Research Engineer at Meta

Yaowei Zheng

Author of LLaMA-Factory

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Pawel Garbacki

Cofounder of Fireworks AI

and 11 more!

Project Summary

HELM (Holistic Evaluation of Language Models) is a Python framework for comprehensive, reproducible, and transparent evaluation of foundation models, including LLMs and multimodal models. It provides standardized datasets, a unified interface for various models, and metrics beyond accuracy, targeting researchers and developers needing robust model assessment.

How It Works

HELM standardizes the evaluation process by offering a unified interface to access diverse models (e.g., OpenAI, Anthropic, Google) and a consistent format for numerous datasets and benchmarks (e.g., MMLU-Pro, GPQA, WildBench). It incorporates metrics for accuracy, efficiency, bias, and toxicity, enabling a holistic view of model performance.

Quick Start & Requirements

Install: pip install crfm-helm
Run benchmark: helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10
Summarize results: helm-summarize --suite my-suite
Start server: helm-server --suite my-suite (access at http://localhost:8000/)
Documentation: Read the Docs

Highlighted Details

Supports evaluation of LLMs and multimodal models.
Includes a web UI for prompt/response inspection and a web leaderboard for result comparison.
Framework used in multiple research papers for model evaluation, with reproducible results.
Offers leaderboards for capabilities, safety, vision-language models (VHELM), and domain-specific evaluations.

Maintenance & Community

Developed by the Center for Research on Foundation Models (CRFM) at Stanford.
Official leaderboards are maintained for recent model evaluations.
Links to papers and documentation for reproducing results are provided.

Licensing & Compatibility

The README does not explicitly state the license. However, the project is open-source and developed by Stanford. Further clarification on licensing is recommended for commercial use.

Limitations & Caveats