llm-leaderboard  by LudwigStumpp

LLM leaderboard for model evaluation

created 2 years ago
304 stars

Top 88.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a community-driven leaderboard for evaluating Large Language Models (LLMs). It aims to centralize performance metrics across various benchmarks, offering a valuable resource for researchers and developers seeking to compare and understand LLM capabilities. The project defines "open" models as those deployable locally and usable for commercial purposes.

How It Works

The leaderboard aggregates performance data from numerous LLMs across standardized benchmarks like Chatbot Arena Elo, HellaSwag, HumanEval, LAMBADA, MMLU, TriviaQA, and WinoGrande. Each entry includes the model name, publisher, an "Open?" status, and scores for each benchmark, with direct links to the source of the data. This approach provides a transparent and verifiable comparison of model performance.

Quick Start & Requirements

The project primarily serves as a data repository and reference. The interactive dashboard can be accessed via Streamlit at https://llm-leaderboard.streamlit.app/ or Hugging Face Spaces at https://huggingface.co/spaces/ludwigstumpp/llm-leaderboard. No direct installation or code execution is required to view the leaderboard.

Highlighted Details

  • Comprehensive comparison across 7+ key LLM benchmarks.
  • Clear distinction between "open" (commercially usable) and closed models.
  • Interactive dashboards for easy exploration and comparison.
  • Community-driven contributions and corrections are encouraged.

Maintenance & Community

The project is a joint community effort, welcoming contributions for table work (filling missing entries, adding models/benchmarks) and code work. Future ideas include adding model year and detailed technical specifications.

Licensing & Compatibility

The repository itself is not explicitly licensed in the provided README. However, the "Open?" column indicates models that are locally deployable and usable for commercial purposes, suggesting a focus on open-source accessibility. Users are advised to consult individual model licenses for commercial use.

Limitations & Caveats

The README explicitly states that "Above information may be wrong," and advises consulting a lawyer for commercial use of published models. The data is collected from individual papers and published results, implying potential for outdated or inaccurate information.

Health Check
Last commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Luca Antiga Luca Antiga(CTO of Lightning AI), and
4 more.

helm by stanford-crfm

0.9%
2k
Open-source Python framework for holistic evaluation of foundation models
created 3 years ago
updated 1 day ago
Feedback? Help us improve.