test by hendrycks

Research paper for measuring multitask language understanding

Created 5 years ago

1,536 stars

Top 26.8% on SourcePulse

View on GitHub

8 Experts Love This Project

Author of SWE-Gym; MTS at xAI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 4 more!

Project Summary

This repository provides the Massive Multitask Language Understanding (MMLU) benchmark, a comprehensive evaluation suite for assessing the knowledge and reasoning capabilities of large language models across diverse domains. It is designed for researchers and developers working on LLM development and evaluation.

How It Works

The MMLU benchmark comprises 57 tasks covering STEM, humanities, social sciences, and other subjects, each presented as a multiple-choice question. The evaluation methodology involves few-shot prompting, where models are provided with a small number of examples before answering the test questions. This approach aims to measure a model's ability to generalize and apply its knowledge in new contexts.

Quick Start & Requirements

The MMLU test is available for download via a provided link. The repository also contains OpenAI API evaluation code. No specific installation commands are provided, but usage likely involves interacting with the downloaded test data and potentially integrating with LLM inference frameworks.

Highlighted Details

Comprehensive benchmark with 57 diverse tasks.
Includes a leaderboard for tracking model performance.
Draws from the ETHICS dataset for evaluation.
Results from various prominent LLMs are presented.

Maintenance & Community

The project is associated with ICLR 2021 and lists several authors from prominent institutions. The README encourages users to reach out or submit pull requests to add models to the leaderboard.

Licensing & Compatibility

The repository does not explicitly state a license. However, the inclusion of research papers and a leaderboard suggests it is intended for academic and research use. Commercial use compatibility is not specified.

Limitations & Caveats

The benchmark primarily focuses on multiple-choice questions and few-shot evaluation, which may not fully capture all aspects of language understanding or real-world performance. The evaluation code is tied to OpenAI API, potentially limiting its direct applicability to models not accessible via that API.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days