llm-autoeval  by mlabonne

Colab notebook for LLM evaluation

created 1 year ago
649 stars

Top 52.4% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a simplified, automated workflow for evaluating Large Language Models (LLMs) within a Google Colab environment, targeting researchers and developers who need to benchmark model performance across various datasets. It streamlines the setup and execution of evaluations, generating shareable summary reports.

How It Works

LLM AutoEval leverages cloud GPU providers like RunPod for compute, abstracting away complex infrastructure setup. Users specify the LLM to evaluate (via Hugging Face model ID), select a benchmark suite (Nous, Lighteval, or OpenLLM), and configure GPU resources. The system then automates the download, execution, and result aggregation, producing a summary that can be uploaded to GitHub Gists for easy sharing and comparison.

Quick Start & Requirements

  • Installation: Primarily run within a Google Colab notebook.
  • Prerequisites:
    • RunPod account and API token (read & write permissions).
    • GitHub account and Personal Access Token (gist scope).
    • Optional: Hugging Face token.
    • Recommended: Beefy GPUs (RTX 3090 or higher) for Open LLM benchmarks.
  • Setup: Requires configuring secrets in Colab's Secrets tab for RunPod and GitHub tokens.
  • Documentation: LLM AutoEval README

Highlighted Details

  • Supports multiple benchmark suites: Nous (AGIEval, GPT4ALL, TruthfulQA, Bigbench), Lighteval (HELM, PIQA, GSM8K, MATH), and OpenLLM (ARC, HellaSwag, MMLU, Winogrande, GSM8K, TruthfulQA).
  • Integrates with vLLM for accelerated inference in the OpenLLM benchmark suite.
  • Automated summary generation and upload to GitHub Gists for easy result sharing and leaderboard creation (e.g., YALL Leaderboard).
  • Customizable evaluation parameters and GPU configurations (type, number, disk size).

Maintenance & Community

  • Project is in early stages, primarily for personal use, with an invitation for contributions.
  • Acknowledgements mention integrations with lighteval (Hugging Face), lm-evaluation-harness (EleutherAI), and vllm.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README.

Limitations & Caveats

  • The project is in its early stages and primarily intended for personal use.
  • Specific benchmark tasks might have issues (e.g., "mmlu" missing in OpenLLM due to vLLM).
  • Hardware limitations can lead to "700 Killed" errors, particularly with demanding benchmarks like Open LLM.
  • Requires specific CUDA driver versions; outdated drivers necessitate starting a new pod.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
37 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 3 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.