llm-leaderboard by LudwigStumpp

LLM leaderboard for model evaluation

Created 2 years ago

307 stars

Top 87.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Eugene Yan

AI Scientist at AWS

Project Summary

This repository provides a community-driven leaderboard for evaluating Large Language Models (LLMs). It aims to centralize performance metrics across various benchmarks, offering a valuable resource for researchers and developers seeking to compare and understand LLM capabilities. The project defines "open" models as those deployable locally and usable for commercial purposes.

How It Works

The leaderboard aggregates performance data from numerous LLMs across standardized benchmarks like Chatbot Arena Elo, HellaSwag, HumanEval, LAMBADA, MMLU, TriviaQA, and WinoGrande. Each entry includes the model name, publisher, an "Open?" status, and scores for each benchmark, with direct links to the source of the data. This approach provides a transparent and verifiable comparison of model performance.

Quick Start & Requirements

The project primarily serves as a data repository and reference. The interactive dashboard can be accessed via Streamlit at https://llm-leaderboard.streamlit.app/ or Hugging Face Spaces at https://huggingface.co/spaces/ludwigstumpp/llm-leaderboard. No direct installation or code execution is required to view the leaderboard.

Highlighted Details

Comprehensive comparison across 7+ key LLM benchmarks.
Clear distinction between "open" (commercially usable) and closed models.
Interactive dashboards for easy exploration and comparison.
Community-driven contributions and corrections are encouraged.

Maintenance & Community

The project is a joint community effort, welcoming contributions for table work (filling missing entries, adding models/benchmarks) and code work. Future ideas include adding model year and detailed technical specifications.

Licensing & Compatibility

The repository itself is not explicitly licensed in the provided README. However, the "Open?" column indicates models that are locally deployable and usable for commercial purposes, suggesting a focus on open-source accessibility. Users are advised to consult individual model licenses for commercial use.

Limitations & Caveats

The README explicitly states that "Above information may be wrong," and advises consulting a lawyer for commercial use of published models. The data is collected from individual papers and published results, implying potential for outdated or inaccurate information.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days