llm_benchmark  by malody2014

LLM evaluation benchmark tracking model evolution

created 5 months ago
318 stars

Top 86.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a personal, long-term evaluation of large language models (LLMs) focusing on logic, mathematics, programming, and human intuition. It targets LLM developers and researchers seeking to track the evolutionary trends of models through a private, rolling dataset of 30 questions with 240 test cases, updated monthly and not publicly disclosed.

How It Works

The evaluation uses a scoring system where each question is worth 10 points, derived from the ratio of correctly answered scoring points to the total available scoring points. Emphasis is placed on correct reasoning processes, with no points awarded for guessed answers. Strict adherence to output format and constraints (e.g., no explanations when not requested) is enforced, with violations resulting in zero points for the question. Models are tested via official APIs or OpenRouter, with specific temperature and token limits applied, and results are averaged over three runs, taking the highest score.

Quick Start & Requirements

This project is not a runnable tool but a benchmark dataset and methodology. Access to LLM APIs (e.g., OpenAI, OpenRouter) is required for testing.

Highlighted Details

  • Private, Rolling Dataset: Utilizes a unique, non-public dataset that changes monthly to prevent overfitting and ensure continuous evaluation.
  • Diverse Question Types: Covers a wide range of tasks including text summarization, algorithmic reasoning, pattern recognition, game simulation, and code interpretation.
  • Detailed Performance Metrics: Tracks original and median scores, cost per million tokens, average response length, and inference time for each model.
  • Focus on Reasoning: Prioritizes correct derivation and logical steps over mere correct answers.

Maintenance & Community

The project is personal in nature, with updates and detailed evaluations often shared on the author's Zhihu and WeChat public accounts. The leaderboard is updated in real-time after new model tests, with monthly score archiving.

Licensing & Compatibility

The repository itself does not specify a license. The evaluation methodology and data are intended for personal observation and understanding of LLM trends.

Limitations & Caveats

The evaluation is explicitly stated as not authoritative or comprehensive, reflecting only one aspect of LLM capabilities. Monthly score fluctuations of approximately +/- 4 points are expected due to dataset updates. The private nature of the test set means it cannot be independently verified.

Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
227 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.