codefuse-devops-eval by codefuse-ai

DevOps-Eval: benchmark for LLMs in the DevOps/AIOps domain

Created 2 years ago

647 stars

Top 51.5% on SourcePulse

Project Summary

This repository provides DevOps-Eval, a comprehensive benchmark suite for evaluating Large Language Models (LLMs) in the DevOps and AIOps domains. It offers a structured way for developers to track model progress and identify strengths and weaknesses, featuring a large dataset of multiple-choice questions and practical scenarios.

How It Works

DevOps-Eval comprises three main categories: DevOps (general), AIOps (log parsing, time series analysis, root cause analysis), and ToolLearning (function calling across various tools). The benchmark includes both zero-shot and few-shot evaluation settings, with specific data splits for development (few-shot examples) and testing. The evaluation framework allows users to integrate and test their own Hugging Face-formatted models by defining custom loader and context builder functions.

Quick Start & Requirements

Data Download: Download devopseval-exam.zip or load via Hugging Face datasets library (load_dataset("DevOps-Eval/devopseval-exam")).
Evaluation: Run python src/run_eval.py with specified model paths, configurations, and dataset details.
Prerequisites: Python, Hugging Face datasets, pandas. Model-specific dependencies will vary.

Highlighted Details

Contains 7486 multiple-choice questions across 8 general DevOps categories.
Includes 2840 AIOps samples covering log parsing, time series anomaly detection, classification, forecasting, and root cause analysis.
Features 1509 ToolLearning samples across 59 fields and 239 tool scenes, compatible with OpenAI's Function Calling format.
Provides a public leaderboard for comparing model performance.

Maintenance & Community

The project is actively updated, with recent additions including ToolLearning samples and AIOps leaderboards. Links to Hugging Face and Chinese/English tutorials are provided.

Licensing & Compatibility

Licensed under the Apache License (Version 2.0). This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is still under development, with planned additions including more samples, harder difficulty levels, and an English version of the samples. The "Coming Soon" note for citation suggests the primary research paper is not yet published.

codefuse-devops-eval by codefuse-ai

Explore Similar Projects

T-Eval by open-compass

UltraEval by OpenBMB

codebase-digest by kamilstanuch

CodeFuse-DevOps-Model by codefuse-ai

olmes by allenai

openbench by groq

aiops-handbook by chenryn

bugbug by mozilla

lighteval by huggingface

ToolBench by OpenBMB

SWE-bench by SWE-bench

ragas by vibrantlabsai