EvaLearn  by ByteDance-Seed

Benchmark for LLM learning capability and efficiency

created 2 months ago
422 stars

Top 70.8% on sourcepulse

GitHubView on GitHub
Project Summary

EvaLearn is a benchmark designed to evaluate the learning capability and efficiency of Large Language Models (LLMs) through sequential problem-solving. It targets researchers and developers seeking to quantify how well LLMs adapt and improve over a series of related tasks, offering a more dynamic assessment than static benchmarks.

How It Works

EvaLearn structures 648 problems into 182 sequences, each focusing on a specific task type. The core innovation is its sequential evaluation approach, where LLMs must solve problems in order, leveraging knowledge gained from prior solutions. This mimics real-world learning scenarios and allows for metrics like learning speed and post-warmup performance.

Quick Start & Requirements

  • Installation: git clone the repository, cd into it, and run pip install -r requirements.txt.
  • Prerequisites: Python 3.7+, API keys for OpenAI and alternative LLMs.
  • Usage: Execute evaluation via python EvaLearn/Evaluate/evaluate.py with specified input, sequence, and output paths, along with API keys. Library usage is also supported via direct import.
  • Resources: Requires API access to LLMs for evaluation.
  • Links: Paper

Highlighted Details

  • Evaluates LLMs on sequential problem-solving, measuring learning capability and efficiency.
  • Employs a sequential evaluation tool (evaluate.py) and structured datasets for problems and sequences.
  • Offers metrics including overall sequence accuracy, accuracy slope (learning speed), average position of first correct solution, and post-warmup accuracy.
  • Supports custom sequence selection by type or ID.

Maintenance & Community

The project is associated with ByteDance and academic institutions (Fudan University). Contact information for Shihan Dou and Ming Zhang is provided.

Licensing & Compatibility

  • Code License: Apache-2.0
  • Data License: CC BY 4.0
  • Compatible with commercial use under Apache-2.0.

Limitations & Caveats

Canonical answers for problems are not open-sourced due to intellectual property concerns, meaning the evaluation relies on LLM-based judging. The benchmark is described as "pioneering," suggesting it may be in early stages of adoption and refinement.

Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
423 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.