T-Eval  by open-compass

Evaluation harness for LLM tool use, step-by-step

created 1 year ago
283 stars

Top 93.3% on sourcepulse

GitHubView on GitHub
Project Summary

T-Eval is an evaluation harness designed to assess the tool utilization capabilities of Large Language Models (LLMs) by decomposing the process into distinct sub-processes. It targets researchers and developers aiming for a granular understanding of LLM performance in tool-augmented applications, offering a more detailed analysis than holistic evaluations.

How It Works

T-Eval breaks down tool utilization into six key sub-processes: instruction following, planning, reasoning, retrieval, understanding, and review. This step-by-step approach allows for a fine-grained analysis of LLM competencies, providing insights into both overall and isolated performance in tool interaction. The evaluation framework supports both API-based and HuggingFace models.

Quick Start & Requirements

  • Installation: Clone the repository, install requirements (pip install -r requirements.txt), and install the lagent library (cd lagent && pip install -e .).
  • Data: Download test data from Google Drive or HuggingFace Datasets.
  • API Models: Set OPENAI_API_KEY environment variable.
  • HuggingFace Models: Download models locally and configure meta_template.json.
  • Execution: Use provided shell scripts (test_all_en.sh, test_all_zh.sh) or Python scripts (test.py) for evaluation.
  • Resources: Requires Python, datasets library, and potentially CUDA for local model inference.

Highlighted Details

  • Supports both English and Chinese evaluation datasets.
  • Offers a leaderboard for comparing LLM tool utilization performance.
  • Integrates with lagent for optimized HuggingFace model inference.
  • Provides submission guidelines for updating the official leaderboard.

Maintenance & Community

The project is built upon lagent and OpenCompass. Submission of results is accepted via email for leaderboard updates.

Licensing & Compatibility

Released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Some models may not support batch inference. The project is actively under development with several items on the TODO list, including support for batch inference and further integration with OpenCompass.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.