LLM leaderboard for model evaluation
Top 88.9% on sourcepulse
This repository provides a community-driven leaderboard for evaluating Large Language Models (LLMs). It aims to centralize performance metrics across various benchmarks, offering a valuable resource for researchers and developers seeking to compare and understand LLM capabilities. The project defines "open" models as those deployable locally and usable for commercial purposes.
How It Works
The leaderboard aggregates performance data from numerous LLMs across standardized benchmarks like Chatbot Arena Elo, HellaSwag, HumanEval, LAMBADA, MMLU, TriviaQA, and WinoGrande. Each entry includes the model name, publisher, an "Open?" status, and scores for each benchmark, with direct links to the source of the data. This approach provides a transparent and verifiable comparison of model performance.
Quick Start & Requirements
The project primarily serves as a data repository and reference. The interactive dashboard can be accessed via Streamlit at https://llm-leaderboard.streamlit.app/
or Hugging Face Spaces at https://huggingface.co/spaces/ludwigstumpp/llm-leaderboard
. No direct installation or code execution is required to view the leaderboard.
Highlighted Details
Maintenance & Community
The project is a joint community effort, welcoming contributions for table work (filling missing entries, adding models/benchmarks) and code work. Future ideas include adding model year and detailed technical specifications.
Licensing & Compatibility
The repository itself is not explicitly licensed in the provided README. However, the "Open?" column indicates models that are locally deployable and usable for commercial purposes, suggesting a focus on open-source accessibility. Users are advised to consult individual model licenses for commercial use.
Limitations & Caveats
The README explicitly states that "Above information may be wrong," and advises consulting a lawyer for commercial use of published models. The data is collected from individual papers and published results, implying potential for outdated or inaccurate information.
11 months ago
Inactive