llm-autoeval  by mlabonne

Colab notebook for LLM evaluation

Created 1 year ago
658 stars

Top 50.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a simplified, automated workflow for evaluating Large Language Models (LLMs) within a Google Colab environment, targeting researchers and developers who need to benchmark model performance across various datasets. It streamlines the setup and execution of evaluations, generating shareable summary reports.

How It Works

LLM AutoEval leverages cloud GPU providers like RunPod for compute, abstracting away complex infrastructure setup. Users specify the LLM to evaluate (via Hugging Face model ID), select a benchmark suite (Nous, Lighteval, or OpenLLM), and configure GPU resources. The system then automates the download, execution, and result aggregation, producing a summary that can be uploaded to GitHub Gists for easy sharing and comparison.

Quick Start & Requirements

  • Installation: Primarily run within a Google Colab notebook.
  • Prerequisites:
    • RunPod account and API token (read & write permissions).
    • GitHub account and Personal Access Token (gist scope).
    • Optional: Hugging Face token.
    • Recommended: Beefy GPUs (RTX 3090 or higher) for Open LLM benchmarks.
  • Setup: Requires configuring secrets in Colab's Secrets tab for RunPod and GitHub tokens.
  • Documentation: LLM AutoEval README

Highlighted Details

  • Supports multiple benchmark suites: Nous (AGIEval, GPT4ALL, TruthfulQA, Bigbench), Lighteval (HELM, PIQA, GSM8K, MATH), and OpenLLM (ARC, HellaSwag, MMLU, Winogrande, GSM8K, TruthfulQA).
  • Integrates with vLLM for accelerated inference in the OpenLLM benchmark suite.
  • Automated summary generation and upload to GitHub Gists for easy result sharing and leaderboard creation (e.g., YALL Leaderboard).
  • Customizable evaluation parameters and GPU configurations (type, number, disk size).

Maintenance & Community

  • Project is in early stages, primarily for personal use, with an invitation for contributions.
  • Acknowledgements mention integrations with lighteval (Hugging Face), lm-evaluation-harness (EleutherAI), and vllm.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README.

Limitations & Caveats

  • The project is in its early stages and primarily intended for personal use.
  • Specific benchmark tasks might have issues (e.g., "mmlu" missing in OpenLLM due to vLLM).
  • Hardware limitations can lead to "700 Killed" errors, particularly with demanding benchmarks like Open LLM.
  • Requires specific CUDA driver versions; outdated drivers necessitate starting a new pod.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

KernelBench by ScalingIntelligence

1.9%
569
Benchmark for LLMs generating GPU kernels from PyTorch ops
Created 10 months ago
Updated 3 weeks ago
Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 18 hours ago
Feedback? Help us improve.