Evaluator by NVIDIA-NeMo

Open-source library for scalable, reproducible AI model and benchmark evaluation

Created 1 year ago

310 stars

Top 86.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Summary

NVIDIA-NeMo/Evaluator is an open-source SDK for scalable, reproducible AI model and benchmark evaluation. It targets researchers and engineers needing to rigorously assess LLMs against numerous benchmarks, offering a unified CLI, pluggable architecture, and containerized execution for auditable results. The platform simplifies integrating public benchmarks and private datasets for efficient model comparison.

How It Works

The system uses two components: the nemo-evaluator core engine and the nemo-evaluator-launcher CLI. Evaluations run in open-source Docker containers, ensuring reproducibility by capturing configurations, seeds, and provenance. This containerized, pluggable architecture scales evaluations from local machines to Slurm or cloud backends (e.g., Lepton AI) without workflow changes, simplifying integration and ensuring auditable results.

Quick Start & Requirements

Installation: pip install nemo-evaluator-launcher.
Model Endpoint: Requires an OpenAI-compatible API endpoint (hosted, self-hosted via NIM/vLLM/TRT-LLM, or NeMo-trained models). Hosted services may need an NGC API key (export NGC_API_KEY=<YOUR_API_KEY>).
Running: Use nemo-evaluator-launcher run --config <path_to_config.yaml> -o execution.output_dir=<YOUR_OUTPUT_LOCAL_DIR>. Example configs are in the repo.
Documentation: NeMo Evaluator Documentation

Highlighted Details

Supports over 100 benchmarks across 18 evaluation harnesses (e.g., lm-evaluation-harness, HELM, MTEB).
Ensures reproducibility by default, capturing parameters for auditable evaluations.
Scalable execution across local, Slurm, and cloud backends.
Features Agentic Skills for interactive configuration, launching, and analysis.

Maintenance & Community

Contributions are welcomed via the Contribution Guide. Discussions are on GitHub Discussions. Anonymous telemetry is collected for project improvement, with opt-out options.

Licensing & Compatibility

Licensed under the Apache License 2.0, permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The nel ls command may require manual Docker authentication and lacks support for macOS Keychain or GNOME Keyring credential management. A preview of v0.3.0 on the dev/0.3.0 branch indicates ongoing development.

Evaluator by NVIDIA-NeMo

Explore Similar Projects

evalyn by shihongDev

agent-skills-eval by darkrishabh

BenchLocal by stevibe

mcpmark by eval-sys

awesome-evals by benchflow-ai

hud-python by hud-evals

claw-eval by claw-eval

evalchemy by mlfoundations

SWE-bench_Pro-os by scaleapi

openbench by groq

agentops by AgentOps-AI

lmms-eval by EvolvingLMMs-Lab