EvaLearn by ByteDance-Seed

Benchmark for LLM learning capability and efficiency

Created 7 months ago

429 stars

Top 69.1% on SourcePulse

Project Summary

EvaLearn is a benchmark designed to evaluate the learning capability and efficiency of Large Language Models (LLMs) through sequential problem-solving. It targets researchers and developers seeking to quantify how well LLMs adapt and improve over a series of related tasks, offering a more dynamic assessment than static benchmarks.

How It Works

EvaLearn structures 648 problems into 182 sequences, each focusing on a specific task type. The core innovation is its sequential evaluation approach, where LLMs must solve problems in order, leveraging knowledge gained from prior solutions. This mimics real-world learning scenarios and allows for metrics like learning speed and post-warmup performance.

Quick Start & Requirements

Installation: git clone the repository, cd into it, and run pip install -r requirements.txt.
Prerequisites: Python 3.7+, API keys for OpenAI and alternative LLMs.
Usage: Execute evaluation via python EvaLearn/Evaluate/evaluate.py with specified input, sequence, and output paths, along with API keys. Library usage is also supported via direct import.
Resources: Requires API access to LLMs for evaluation.
Links: Paper

Highlighted Details

Evaluates LLMs on sequential problem-solving, measuring learning capability and efficiency.
Employs a sequential evaluation tool (evaluate.py) and structured datasets for problems and sequences.
Offers metrics including overall sequence accuracy, accuracy slope (learning speed), average position of first correct solution, and post-warmup accuracy.
Supports custom sequence selection by type or ID.

Maintenance & Community

The project is associated with ByteDance and academic institutions (Fudan University). Contact information for Shihan Dou and Ming Zhang is provided.

Licensing & Compatibility

Code License: Apache-2.0
Data License: CC BY 4.0
Compatible with commercial use under Apache-2.0.

Limitations & Caveats

Canonical answers for problems are not open-sourced due to intellectual property concerns, meaning the evaluation relies on LLM-based judging. The benchmark is described as "pioneering," suggesting it may be in early stages of adoption and refinement.

EvaLearn by ByteDance-Seed

Explore Similar Projects

ToRL by GAIR-NLP

DFT by yongliang-wu

XBai-o4 by MetaStone-AI

l1 by cmu-l3

aimo-progress-prize by project-numina

MetaMath by meta-math

alpaca_farm by tatsu-lab

R-Zero by Chengsong-Huang

MiMo by XiaomiMiMo

ceval by hkust-nlp

LongBench by THUDM

chain-of-thought-hub by FranxYao