llm_benchmark by llm2014

LLM evaluation benchmark tracking model evolution

Created 11 months ago

743 stars

Top 46.8% on SourcePulse

Project Summary

This repository provides a personal, long-term evaluation of large language models (LLMs) focusing on logic, mathematics, programming, and human intuition. It targets LLM developers and researchers seeking to track the evolutionary trends of models through a private, rolling dataset of 30 questions with 240 test cases, updated monthly and not publicly disclosed.

How It Works

The evaluation uses a scoring system where each question is worth 10 points, derived from the ratio of correctly answered scoring points to the total available scoring points. Emphasis is placed on correct reasoning processes, with no points awarded for guessed answers. Strict adherence to output format and constraints (e.g., no explanations when not requested) is enforced, with violations resulting in zero points for the question. Models are tested via official APIs or OpenRouter, with specific temperature and token limits applied, and results are averaged over three runs, taking the highest score.

Quick Start & Requirements

This project is not a runnable tool but a benchmark dataset and methodology. Access to LLM APIs (e.g., OpenAI, OpenRouter) is required for testing.

Highlighted Details

Private, Rolling Dataset: Utilizes a unique, non-public dataset that changes monthly to prevent overfitting and ensure continuous evaluation.
Diverse Question Types: Covers a wide range of tasks including text summarization, algorithmic reasoning, pattern recognition, game simulation, and code interpretation.
Detailed Performance Metrics: Tracks original and median scores, cost per million tokens, average response length, and inference time for each model.
Focus on Reasoning: Prioritizes correct derivation and logical steps over mere correct answers.

Maintenance & Community

The project is personal in nature, with updates and detailed evaluations often shared on the author's Zhihu and WeChat public accounts. The leaderboard is updated in real-time after new model tests, with monthly score archiving.

Licensing & Compatibility

The repository itself does not specify a license. The evaluation methodology and data are intended for personal observation and understanding of LLM trends.

Limitations & Caveats

The evaluation is explicitly stated as not authoritative or comprehensive, reflecting only one aspect of LLM capabilities. Monthly score fluctuations of approximately +/- 4 points are expected due to dataset updates. The private nature of the test set means it cannot be independently verified.

llm_benchmark by llm2014

Explore Similar Projects

ML-Bench by gersteinlab

LEval by OpenLMLab

Mengzi3 by Langboat

phasellm by wgryc

InfiniteBench by OpenBMB

code-eval by abacaj

OLMo-Eval-Legacy by allenai

MultiPL-E by nuprl

prometheus-eval by prometheus-eval

RULER by NVIDIA

alpaca_eval by tatsu-lab

lm-evaluation-harness by EleutherAI