LLM evaluation benchmark tracking model evolution
Top 86.3% on sourcepulse
This repository provides a personal, long-term evaluation of large language models (LLMs) focusing on logic, mathematics, programming, and human intuition. It targets LLM developers and researchers seeking to track the evolutionary trends of models through a private, rolling dataset of 30 questions with 240 test cases, updated monthly and not publicly disclosed.
How It Works
The evaluation uses a scoring system where each question is worth 10 points, derived from the ratio of correctly answered scoring points to the total available scoring points. Emphasis is placed on correct reasoning processes, with no points awarded for guessed answers. Strict adherence to output format and constraints (e.g., no explanations when not requested) is enforced, with violations resulting in zero points for the question. Models are tested via official APIs or OpenRouter, with specific temperature and token limits applied, and results are averaged over three runs, taking the highest score.
Quick Start & Requirements
This project is not a runnable tool but a benchmark dataset and methodology. Access to LLM APIs (e.g., OpenAI, OpenRouter) is required for testing.
Highlighted Details
Maintenance & Community
The project is personal in nature, with updates and detailed evaluations often shared on the author's Zhihu and WeChat public accounts. The leaderboard is updated in real-time after new model tests, with monthly score archiving.
Licensing & Compatibility
The repository itself does not specify a license. The evaluation methodology and data are intended for personal observation and understanding of LLM trends.
Limitations & Caveats
The evaluation is explicitly stated as not authoritative or comprehensive, reflecting only one aspect of LLM capabilities. Monthly score fluctuations of approximately +/- 4 points are expected due to dataset updates. The private nature of the test set means it cannot be independently verified.
2 days ago
Inactive