UltraEval  by OpenBMB

An open-source framework for evaluating foundation models

Created 2 years ago
258 stars

Top 98.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

UltraEval is an open-source framework for evaluating foundation models, offering a lightweight, easy-to-use, and scalable system for assessing mainstream LLMs. It benefits researchers and engineers by providing a standardized, transparent, and flexible evaluation process.

How It Works

The framework features a lightweight design with minimal dependencies for effortless deployment and scalability. It supports a unified prompt template with extensive, customizable evaluation metrics. For efficient assessment, UltraEval integrates multiple model deployment strategies, including torch and vLLM, enabling swift, multi-instance evaluation.

Quick Start & Requirements

Installation: git clone https://github.com/OpenBMB/UltraEval.git, cd UltraEval, pip install .. Key steps involve downloading datasets (wget "https://cloud.tsinghua.edu.cn/f/11d562a53e40411fb385/?dl=1"), unzipping, preprocessing, and generating config files (python configs/make_config.py). Model evaluation requires deployment (e.g., python URLs/vllm_url.py) and running python main.py. Prerequisites: Python, wget, unzip; GPU/CUDA recommended for deployment. Resources: paper, website, quick start, tutorials, Colab notebook.

Highlighted Details

  • Supports 59 diverse evaluation datasets across knowledge, math, code, reasoning, and language tasks.
  • Features a flexible system with a unified prompt template and extensive, customizable metrics.
  • Enables efficient inference deployment via torch and vLLM for rapid, multi-instance evaluation.
  • Maintains a transparent, traceable, and reproducible open-source leaderboard.
  • Utilizes official evaluation sets for standardized, comparable results.

Maintenance & Community

Accepted into ACL 2024 System Demonstration Track and published its paper. MiniCPM uses UltraEval for evaluations. Open-sourced in late 2023. Community engagement via GitHub Issues for discussions and feature requests. Acknowledgements: HuggingFace, vLLM, Harness, OpenCompass.

Licensing & Compatibility

Released under the Apache-2.0 license, which is permissive for commercial use and integration within closed-source projects.

Limitations & Caveats

The README does not explicitly detail limitations like alpha status or known bugs. Advanced usage or specific configurations may require consulting Tutorials.md.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Michael Chiang Michael Chiang(Cofounder of Ollama), and
7 more.

openbench by groq

0.6%
782
Provider-agnostic LLM evaluation infrastructure
Created 10 months ago
Updated 1 month ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
12 more.

evaluate by huggingface

0.1%
2k
ML model evaluation library for standardized performance reporting
Created 4 years ago
Updated 2 weeks ago
Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
8 more.

lighteval by huggingface

0.2%
2k
LLM evaluation toolkit for multiple backends
Created 2 years ago
Updated 3 days ago
Feedback? Help us improve.