llm_benchmarks by leobeeson

LLM benchmarks and datasets for evaluating language models

Created 2 years ago

540 stars

Top 58.8% on SourcePulse

Project Summary

This repository provides a comprehensive collection of benchmarks and datasets for evaluating Large Language Models (LLMs), targeting AI researchers and developers. It aims to standardize LLM evaluation across various capabilities, from knowledge understanding to coding, offering a structured approach to assessing model performance.

How It Works

The project curates and organizes established benchmarks, categorizing them by LLM capability (e.g., Reasoning, Question Answering, Coding). Each benchmark includes a description, purpose, relevance, and links to resources like datasets and GitHub repositories. This structured approach allows users to select appropriate evaluations for their specific LLM development or research needs.

Quick Start & Requirements

Datasets are primarily available via HuggingFace or direct GitHub links.
No specific installation command is provided; users are expected to download and integrate datasets as needed.
Requirements vary per dataset, often including Python and libraries like HuggingFace's datasets.

Highlighted Details

Covers a wide spectrum of LLM evaluation tasks: Knowledge, Reasoning, QA, Summarization, Coding, and Responsible AI.
Includes benchmarks for multi-turn conversations (MT-bench) and LLM-assisted evaluation (LLM Judge, JudgeLM, Prometheus).
Features datasets for specialized areas like medical notes (ACI-BENCH) and implicit hate speech detection (ToxiGen).
Provides insights into LLM evaluation methodologies, favoring reproducible leaderboards over arenas or LLM-as-a-judge.

Maintenance & Community

The project appears to be a curated collection rather than an actively developed software project.
Mentions of "Latent Space - Benchmarks 201" and "OpenLLM Leaderboard" suggest community involvement in LLM evaluation standards.
Links to various GitHub repositories and HuggingFace datasets indicate community contributions.

Licensing & Compatibility

Licenses vary per dataset and benchmark, as indicated by links to their respective sources. Users must consult individual dataset licenses for compatibility and usage restrictions.

Limitations & Caveats

The repository itself is a collection, not a unified evaluation framework; users must manage individual dataset integrations.
The README notes that LLMs are not recommended as judges due to biases, suggesting open LLMs like Prometheus or JudgeLM for reproducibility if LLM-as-a-judge is used.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days