Evaluation harness for LLM tool use, step-by-step
Top 93.3% on sourcepulse
T-Eval is an evaluation harness designed to assess the tool utilization capabilities of Large Language Models (LLMs) by decomposing the process into distinct sub-processes. It targets researchers and developers aiming for a granular understanding of LLM performance in tool-augmented applications, offering a more detailed analysis than holistic evaluations.
How It Works
T-Eval breaks down tool utilization into six key sub-processes: instruction following, planning, reasoning, retrieval, understanding, and review. This step-by-step approach allows for a fine-grained analysis of LLM competencies, providing insights into both overall and isolated performance in tool interaction. The evaluation framework supports both API-based and HuggingFace models.
Quick Start & Requirements
pip install -r requirements.txt
), and install the lagent
library (cd lagent && pip install -e .
).OPENAI_API_KEY
environment variable.meta_template.json
.test_all_en.sh
, test_all_zh.sh
) or Python scripts (test.py
) for evaluation.datasets
library, and potentially CUDA for local model inference.Highlighted Details
lagent
for optimized HuggingFace model inference.Maintenance & Community
The project is built upon lagent
and OpenCompass
. Submission of results is accepted via email for leaderboard updates.
Licensing & Compatibility
Released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Some models may not support batch inference. The project is actively under development with several items on the TODO list, including support for batch inference and further integration with OpenCompass.
1 year ago
1 week