EvaLearn  by ByteDance-Seed

Benchmark for LLM learning capability and efficiency

Created 3 months ago
422 stars

Top 69.8% on SourcePulse

GitHubView on GitHub
Project Summary

EvaLearn is a benchmark designed to evaluate the learning capability and efficiency of Large Language Models (LLMs) through sequential problem-solving. It targets researchers and developers seeking to quantify how well LLMs adapt and improve over a series of related tasks, offering a more dynamic assessment than static benchmarks.

How It Works

EvaLearn structures 648 problems into 182 sequences, each focusing on a specific task type. The core innovation is its sequential evaluation approach, where LLMs must solve problems in order, leveraging knowledge gained from prior solutions. This mimics real-world learning scenarios and allows for metrics like learning speed and post-warmup performance.

Quick Start & Requirements

  • Installation: git clone the repository, cd into it, and run pip install -r requirements.txt.
  • Prerequisites: Python 3.7+, API keys for OpenAI and alternative LLMs.
  • Usage: Execute evaluation via python EvaLearn/Evaluate/evaluate.py with specified input, sequence, and output paths, along with API keys. Library usage is also supported via direct import.
  • Resources: Requires API access to LLMs for evaluation.
  • Links: Paper

Highlighted Details

  • Evaluates LLMs on sequential problem-solving, measuring learning capability and efficiency.
  • Employs a sequential evaluation tool (evaluate.py) and structured datasets for problems and sequences.
  • Offers metrics including overall sequence accuracy, accuracy slope (learning speed), average position of first correct solution, and post-warmup accuracy.
  • Supports custom sequence selection by type or ID.

Maintenance & Community

The project is associated with ByteDance and academic institutions (Fudan University). Contact information for Shihan Dou and Ming Zhang is provided.

Licensing & Compatibility

  • Code License: Apache-2.0
  • Data License: CC BY 4.0
  • Compatible with commercial use under Apache-2.0.

Limitations & Caveats

Canonical answers for problems are not open-sourced due to intellectual property concerns, meaning the evaluation relies on LLM-based judging. The benchmark is described as "pioneering," suggesting it may be in early stages of adoption and refinement.

Health Check
Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
4 more.

alpaca_farm by tatsu-lab

0.1%
826
RLHF simulation framework for accessible instruction-following/alignment research
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.