helm  by stanford-crfm

Open-source Python framework for holistic evaluation of foundation models

Created 3 years ago
2,476 stars

Top 18.8% on SourcePulse

GitHubView on GitHub
Project Summary

HELM (Holistic Evaluation of Language Models) is a Python framework for comprehensive, reproducible, and transparent evaluation of foundation models, including LLMs and multimodal models. It provides standardized datasets, a unified interface for various models, and metrics beyond accuracy, targeting researchers and developers needing robust model assessment.

How It Works

HELM standardizes the evaluation process by offering a unified interface to access diverse models (e.g., OpenAI, Anthropic, Google) and a consistent format for numerous datasets and benchmarks (e.g., MMLU-Pro, GPQA, WildBench). It incorporates metrics for accuracy, efficiency, bias, and toxicity, enabling a holistic view of model performance.

Quick Start & Requirements

  • Install: pip install crfm-helm
  • Run benchmark: helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10
  • Summarize results: helm-summarize --suite my-suite
  • Start server: helm-server --suite my-suite (access at http://localhost:8000/)
  • Documentation: Read the Docs

Highlighted Details

  • Supports evaluation of LLMs and multimodal models.
  • Includes a web UI for prompt/response inspection and a web leaderboard for result comparison.
  • Framework used in multiple research papers for model evaluation, with reproducible results.
  • Offers leaderboards for capabilities, safety, vision-language models (VHELM), and domain-specific evaluations.

Maintenance & Community

  • Developed by the Center for Research on Foundation Models (CRFM) at Stanford.
  • Official leaderboards are maintained for recent model evaluations.
  • Links to papers and documentation for reproducing results are provided.

Licensing & Compatibility

  • The README does not explicitly state the license. However, the project is open-source and developed by Stanford. Further clarification on licensing is recommended for commercial use.

Limitations & Caveats

  • The README does not specify any limitations or known caveats regarding the framework's functionality or stability.
Health Check
Last Commit

14 hours ago

Responsiveness

1 week

Pull Requests (30d)
62
Issues (30d)
5
Star History
57 stars in the last 30 days

Explore Similar Projects

Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
12 more.

evaluate by huggingface

0.1%
2k
ML model evaluation library for standardized performance reporting
Created 3 years ago
Updated 1 month ago
Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Travis Fischer Travis Fischer(Founder of Agentic), and
2 more.

modelscope by modelscope

0.2%
8k
Model-as-a-Service library for model inference, training, and evaluation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.