Evaluator  by NVIDIA-NeMo

Open-source library for scalable, reproducible AI model and benchmark evaluation

Created 9 months ago
253 stars

Top 99.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

NVIDIA-NeMo/Evaluator is an open-source SDK for scalable, reproducible AI model and benchmark evaluation. It targets researchers and engineers needing to rigorously assess LLMs against numerous benchmarks, offering a unified CLI, pluggable architecture, and containerized execution for auditable results. The platform simplifies integrating public benchmarks and private datasets for efficient model comparison.

How It Works

The system uses two components: the nemo-evaluator core engine and the nemo-evaluator-launcher CLI. Evaluations run in open-source Docker containers, ensuring reproducibility by capturing configurations, seeds, and provenance. This containerized, pluggable architecture scales evaluations from local machines to Slurm or cloud backends (e.g., Lepton AI) without workflow changes, simplifying integration and ensuring auditable results.

Quick Start & Requirements

  • Installation: pip install nemo-evaluator-launcher.
  • Model Endpoint: Requires an OpenAI-compatible API endpoint (hosted, self-hosted via NIM/vLLM/TRT-LLM, or NeMo-trained models). Hosted services may need an NGC API key (export NGC_API_KEY=<YOUR_API_KEY>).
  • Running: Use nemo-evaluator-launcher run --config <path_to_config.yaml> -o execution.output_dir=<YOUR_OUTPUT_LOCAL_DIR>. Example configs are in the repo.
  • Documentation: NeMo Evaluator Documentation

Highlighted Details

  • Supports over 100 benchmarks across 18 evaluation harnesses (e.g., lm-evaluation-harness, HELM, MTEB).
  • Ensures reproducibility by default, capturing parameters for auditable evaluations.
  • Scalable execution across local, Slurm, and cloud backends.
  • Features Agentic Skills for interactive configuration, launching, and analysis.

Maintenance & Community

Contributions are welcomed via the Contribution Guide. Discussions are on GitHub Discussions. Anonymous telemetry is collected for project improvement, with opt-out options.

Licensing & Compatibility

Licensed under the Apache License 2.0, permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The nel ls command may require manual Docker authentication and lacks support for macOS Keychain or GNOME Keyring credential management. A preview of v0.3.0 on the dev/0.3.0 branch indicates ongoing development.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
58
Issues (30d)
4
Star History
24 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.