llm-autoeval by mlabonne

Colab notebook for LLM evaluation

Created 2 years ago

680 stars

Top 49.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Pietro Schirano

Founder of MagicPath

Omar Sanseviero

DevRel at Google DeepMind

Philipp Schmid

DevRel at Google DeepMind

Project Summary

This project provides a simplified, automated workflow for evaluating Large Language Models (LLMs) within a Google Colab environment, targeting researchers and developers who need to benchmark model performance across various datasets. It streamlines the setup and execution of evaluations, generating shareable summary reports.

How It Works

LLM AutoEval leverages cloud GPU providers like RunPod for compute, abstracting away complex infrastructure setup. Users specify the LLM to evaluate (via Hugging Face model ID), select a benchmark suite (Nous, Lighteval, or OpenLLM), and configure GPU resources. The system then automates the download, execution, and result aggregation, producing a summary that can be uploaded to GitHub Gists for easy sharing and comparison.

Quick Start & Requirements

Installation: Primarily run within a Google Colab notebook.
Prerequisites:
- RunPod account and API token (read & write permissions).
- GitHub account and Personal Access Token (gist scope).
- Optional: Hugging Face token.
- Recommended: Beefy GPUs (RTX 3090 or higher) for Open LLM benchmarks.
Setup: Requires configuring secrets in Colab's Secrets tab for RunPod and GitHub tokens.
Documentation: LLM AutoEval README

Highlighted Details

Supports multiple benchmark suites: Nous (AGIEval, GPT4ALL, TruthfulQA, Bigbench), Lighteval (HELM, PIQA, GSM8K, MATH), and OpenLLM (ARC, HellaSwag, MMLU, Winogrande, GSM8K, TruthfulQA).
Integrates with vLLM for accelerated inference in the OpenLLM benchmark suite.
Automated summary generation and upload to GitHub Gists for easy result sharing and leaderboard creation (e.g., YALL Leaderboard).
Customizable evaluation parameters and GPU configurations (type, number, disk size).

Maintenance & Community

Project is in early stages, primarily for personal use, with an invitation for contributions.
Acknowledgements mention integrations with lighteval (Hugging Face), lm-evaluation-harness (EleutherAI), and vllm.

Licensing & Compatibility

The repository's license is not explicitly stated in the README.

Limitations & Caveats

The project is in its early stages and primarily intended for personal use.
Specific benchmark tasks might have issues (e.g., "mmlu" missing in OpenLLM due to vLLM).
Hardware limitations can lead to "700 Killed" errors, particularly with demanding benchmarks like Open LLM.
Requires specific CUDA driver versions; outdated drivers necessitate starting a new pod.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days