test  by hendrycks

Research paper for measuring multitask language understanding

created 4 years ago
1,463 stars

Top 28.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the Massive Multitask Language Understanding (MMLU) benchmark, a comprehensive evaluation suite for assessing the knowledge and reasoning capabilities of large language models across diverse domains. It is designed for researchers and developers working on LLM development and evaluation.

How It Works

The MMLU benchmark comprises 57 tasks covering STEM, humanities, social sciences, and other subjects, each presented as a multiple-choice question. The evaluation methodology involves few-shot prompting, where models are provided with a small number of examples before answering the test questions. This approach aims to measure a model's ability to generalize and apply its knowledge in new contexts.

Quick Start & Requirements

The MMLU test is available for download via a provided link. The repository also contains OpenAI API evaluation code. No specific installation commands are provided, but usage likely involves interacting with the downloaded test data and potentially integrating with LLM inference frameworks.

Highlighted Details

  • Comprehensive benchmark with 57 diverse tasks.
  • Includes a leaderboard for tracking model performance.
  • Draws from the ETHICS dataset for evaluation.
  • Results from various prominent LLMs are presented.

Maintenance & Community

The project is associated with ICLR 2021 and lists several authors from prominent institutions. The README encourages users to reach out or submit pull requests to add models to the leaderboard.

Licensing & Compatibility

The repository does not explicitly state a license. However, the inclusion of research papers and a leaderboard suggests it is intended for academic and research use. Commercial use compatibility is not specified.

Limitations & Caveats

The benchmark primarily focuses on multiple-choice questions and few-shot evaluation, which may not fully capture all aspects of language understanding or real-world performance. The evaluation code is tied to OpenAI API, potentially limiting its direct applicability to models not accessible via that API.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
71 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Luca Antiga Luca Antiga(CTO of Lightning AI), and
4 more.

helm by stanford-crfm

0.9%
2k
Open-source Python framework for holistic evaluation of foundation models
created 3 years ago
updated 1 day ago
Feedback? Help us improve.