DevOps-Eval: benchmark for LLMs in the DevOps/AIOps domain
Top 52.9% on sourcepulse
This repository provides DevOps-Eval, a comprehensive benchmark suite for evaluating Large Language Models (LLMs) in the DevOps and AIOps domains. It offers a structured way for developers to track model progress and identify strengths and weaknesses, featuring a large dataset of multiple-choice questions and practical scenarios.
How It Works
DevOps-Eval comprises three main categories: DevOps (general), AIOps (log parsing, time series analysis, root cause analysis), and ToolLearning (function calling across various tools). The benchmark includes both zero-shot and few-shot evaluation settings, with specific data splits for development (few-shot examples) and testing. The evaluation framework allows users to integrate and test their own Hugging Face-formatted models by defining custom loader and context builder functions.
Quick Start & Requirements
devopseval-exam.zip
or load via Hugging Face datasets
library (load_dataset("DevOps-Eval/devopseval-exam")
).python src/run_eval.py
with specified model paths, configurations, and dataset details.datasets
, pandas
. Model-specific dependencies will vary.Highlighted Details
Maintenance & Community
The project is actively updated, with recent additions including ToolLearning samples and AIOps leaderboards. Links to Hugging Face and Chinese/English tutorials are provided.
Licensing & Compatibility
Licensed under the Apache License (Version 2.0). This license is permissive and generally compatible with commercial use and closed-source linking.
Limitations & Caveats
The project is still under development, with planned additions including more samples, harder difficulty levels, and an English version of the samples. The "Coming Soon" note for citation suggests the primary research paper is not yet published.
1 year ago
1+ week