helm  by stanford-crfm

Open-source Python framework for holistic evaluation of foundation models

created 3 years ago
2,373 stars

Top 19.8% on sourcepulse

GitHubView on GitHub
Project Summary

HELM (Holistic Evaluation of Language Models) is a Python framework for comprehensive, reproducible, and transparent evaluation of foundation models, including LLMs and multimodal models. It provides standardized datasets, a unified interface for various models, and metrics beyond accuracy, targeting researchers and developers needing robust model assessment.

How It Works

HELM standardizes the evaluation process by offering a unified interface to access diverse models (e.g., OpenAI, Anthropic, Google) and a consistent format for numerous datasets and benchmarks (e.g., MMLU-Pro, GPQA, WildBench). It incorporates metrics for accuracy, efficiency, bias, and toxicity, enabling a holistic view of model performance.

Quick Start & Requirements

  • Install: pip install crfm-helm
  • Run benchmark: helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10
  • Summarize results: helm-summarize --suite my-suite
  • Start server: helm-server --suite my-suite (access at http://localhost:8000/)
  • Documentation: Read the Docs

Highlighted Details

  • Supports evaluation of LLMs and multimodal models.
  • Includes a web UI for prompt/response inspection and a web leaderboard for result comparison.
  • Framework used in multiple research papers for model evaluation, with reproducible results.
  • Offers leaderboards for capabilities, safety, vision-language models (VHELM), and domain-specific evaluations.

Maintenance & Community

  • Developed by the Center for Research on Foundation Models (CRFM) at Stanford.
  • Official leaderboards are maintained for recent model evaluations.
  • Links to papers and documentation for reproducing results are provided.

Licensing & Compatibility

  • The README does not explicitly state the license. However, the project is open-source and developed by Stanford. Further clarification on licensing is recommended for commercial use.

Limitations & Caveats

  • The README does not specify any limitations or known caveats regarding the framework's functionality or stability.
Health Check
Last commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)
68
Issues (30d)
14
Star History
181 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.