T-Eval by open-compass

Evaluation harness for LLM tool use, step-by-step

Created 2 years ago

302 stars

Top 88.5% on SourcePulse

Project Summary

T-Eval is an evaluation harness designed to assess the tool utilization capabilities of Large Language Models (LLMs) by decomposing the process into distinct sub-processes. It targets researchers and developers aiming for a granular understanding of LLM performance in tool-augmented applications, offering a more detailed analysis than holistic evaluations.

How It Works

T-Eval breaks down tool utilization into six key sub-processes: instruction following, planning, reasoning, retrieval, understanding, and review. This step-by-step approach allows for a fine-grained analysis of LLM competencies, providing insights into both overall and isolated performance in tool interaction. The evaluation framework supports both API-based and HuggingFace models.

Quick Start & Requirements

Installation: Clone the repository, install requirements (pip install -r requirements.txt), and install the lagent library (cd lagent && pip install -e .).
Data: Download test data from Google Drive or HuggingFace Datasets.
API Models: Set OPENAI_API_KEY environment variable.
HuggingFace Models: Download models locally and configure meta_template.json.
Execution: Use provided shell scripts (test_all_en.sh, test_all_zh.sh) or Python scripts (test.py) for evaluation.
Resources: Requires Python, datasets library, and potentially CUDA for local model inference.

Highlighted Details

Supports both English and Chinese evaluation datasets.
Offers a leaderboard for comparing LLM tool utilization performance.
Integrates with lagent for optimized HuggingFace model inference.
Provides submission guidelines for updating the official leaderboard.

Maintenance & Community

The project is built upon lagent and OpenCompass. Submission of results is accepted via email for leaderboard updates.

Licensing & Compatibility

Released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Some models may not support batch inference. The project is actively under development with several items on the TODO list, including support for batch inference and further integration with OpenCompass.

T-Eval by open-compass

Explore Similar Projects

phasellm by wgryc

OLMo-Eval-Legacy by allenai

Awesome-LLM-Eval by onejune2018

fmeval by aws

PAI-RAG by aigc-apps

prometheus-eval by prometheus-eval

olmes by allenai

openbench by groq

arena-hard-auto by lmarena

inspect_ai by UKGovernmentBEIS

lighteval by huggingface

opencompass by open-compass