UltraEval  by OpenBMB

An open-source framework for evaluating foundation models

Created 1 year ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

UltraEval is an open-source framework for evaluating foundation models, offering a lightweight, easy-to-use, and scalable system for assessing mainstream LLMs. It benefits researchers and engineers by providing a standardized, transparent, and flexible evaluation process.

How It Works

The framework features a lightweight design with minimal dependencies for effortless deployment and scalability. It supports a unified prompt template with extensive, customizable evaluation metrics. For efficient assessment, UltraEval integrates multiple model deployment strategies, including torch and vLLM, enabling swift, multi-instance evaluation.

Quick Start & Requirements

Installation: git clone https://github.com/OpenBMB/UltraEval.git, cd UltraEval, pip install .. Key steps involve downloading datasets (wget "https://cloud.tsinghua.edu.cn/f/11d562a53e40411fb385/?dl=1"), unzipping, preprocessing, and generating config files (python configs/make_config.py). Model evaluation requires deployment (e.g., python URLs/vllm_url.py) and running python main.py. Prerequisites: Python, wget, unzip; GPU/CUDA recommended for deployment. Resources: paper, website, quick start, tutorials, Colab notebook.

Highlighted Details

  • Supports 59 diverse evaluation datasets across knowledge, math, code, reasoning, and language tasks.
  • Features a flexible system with a unified prompt template and extensive, customizable metrics.
  • Enables efficient inference deployment via torch and vLLM for rapid, multi-instance evaluation.
  • Maintains a transparent, traceable, and reproducible open-source leaderboard.
  • Utilizes official evaluation sets for standardized, comparable results.

Maintenance & Community

Accepted into ACL 2024 System Demonstration Track and published its paper. MiniCPM uses UltraEval for evaluations. Open-sourced in late 2023. Community engagement via GitHub Issues for discussions and feature requests. Acknowledgements: HuggingFace, vLLM, Harness, OpenCompass.

Licensing & Compatibility

Released under the Apache-2.0 license, which is permissive for commercial use and integration within closed-source projects.

Limitations & Caveats

The README does not explicitly detail limitations like alpha status or known bugs. Advanced usage or specific configurations may require consulting Tutorials.md.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Maxime Labonne Maxime Labonne(Head of Post-Training at Liquid AI), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
5 more.

openbench by groq

2.6%
592
Provider-agnostic LLM evaluation infrastructure
Created 2 months ago
Updated 2 days ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
12 more.

evaluate by huggingface

0.1%
2k
ML model evaluation library for standardized performance reporting
Created 3 years ago
Updated 2 weeks ago
Feedback? Help us improve.