Benchmark for LLM learning capability and efficiency
Top 70.8% on sourcepulse
EvaLearn is a benchmark designed to evaluate the learning capability and efficiency of Large Language Models (LLMs) through sequential problem-solving. It targets researchers and developers seeking to quantify how well LLMs adapt and improve over a series of related tasks, offering a more dynamic assessment than static benchmarks.
How It Works
EvaLearn structures 648 problems into 182 sequences, each focusing on a specific task type. The core innovation is its sequential evaluation approach, where LLMs must solve problems in order, leveraging knowledge gained from prior solutions. This mimics real-world learning scenarios and allows for metrics like learning speed and post-warmup performance.
Quick Start & Requirements
git clone
the repository, cd
into it, and run pip install -r requirements.txt
.python EvaLearn/Evaluate/evaluate.py
with specified input, sequence, and output paths, along with API keys. Library usage is also supported via direct import.Highlighted Details
evaluate.py
) and structured datasets for problems and sequences.Maintenance & Community
The project is associated with ByteDance and academic institutions (Fudan University). Contact information for Shihan Dou and Ming Zhang is provided.
Licensing & Compatibility
Limitations & Caveats
Canonical answers for problems are not open-sourced due to intellectual property concerns, meaning the evaluation relies on LLM-based judging. The benchmark is described as "pioneering," suggesting it may be in early stages of adoption and refinement.
2 weeks ago
Inactive