This project provides a simplified, automated workflow for evaluating Large Language Models (LLMs) within a Google Colab environment, targeting researchers and developers who need to benchmark model performance across various datasets. It streamlines the setup and execution of evaluations, generating shareable summary reports.
How It Works
LLM AutoEval leverages cloud GPU providers like RunPod for compute, abstracting away complex infrastructure setup. Users specify the LLM to evaluate (via Hugging Face model ID), select a benchmark suite (Nous, Lighteval, or OpenLLM), and configure GPU resources. The system then automates the download, execution, and result aggregation, producing a summary that can be uploaded to GitHub Gists for easy sharing and comparison.
Quick Start & Requirements
- Installation: Primarily run within a Google Colab notebook.
- Prerequisites:
- RunPod account and API token (read & write permissions).
- GitHub account and Personal Access Token (gist scope).
- Optional: Hugging Face token.
- Recommended: Beefy GPUs (RTX 3090 or higher) for Open LLM benchmarks.
- Setup: Requires configuring secrets in Colab's Secrets tab for RunPod and GitHub tokens.
- Documentation: LLM AutoEval README
Highlighted Details
- Supports multiple benchmark suites: Nous (AGIEval, GPT4ALL, TruthfulQA, Bigbench), Lighteval (HELM, PIQA, GSM8K, MATH), and OpenLLM (ARC, HellaSwag, MMLU, Winogrande, GSM8K, TruthfulQA).
- Integrates with vLLM for accelerated inference in the OpenLLM benchmark suite.
- Automated summary generation and upload to GitHub Gists for easy result sharing and leaderboard creation (e.g., YALL Leaderboard).
- Customizable evaluation parameters and GPU configurations (type, number, disk size).
Maintenance & Community
- Project is in early stages, primarily for personal use, with an invitation for contributions.
- Acknowledgements mention integrations with
lighteval
(Hugging Face), lm-evaluation-harness
(EleutherAI), and vllm
.
Licensing & Compatibility
- The repository's license is not explicitly stated in the README.
Limitations & Caveats
- The project is in its early stages and primarily intended for personal use.
- Specific benchmark tasks might have issues (e.g., "mmlu" missing in OpenLLM due to vLLM).
- Hardware limitations can lead to "700 Killed" errors, particularly with demanding benchmarks like Open LLM.
- Requires specific CUDA driver versions; outdated drivers necessitate starting a new pod.