codefuse-devops-eval  by codefuse-ai

DevOps-Eval: benchmark for LLMs in the DevOps/AIOps domain

Created 2 years ago
641 stars

Top 51.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides DevOps-Eval, a comprehensive benchmark suite for evaluating Large Language Models (LLMs) in the DevOps and AIOps domains. It offers a structured way for developers to track model progress and identify strengths and weaknesses, featuring a large dataset of multiple-choice questions and practical scenarios.

How It Works

DevOps-Eval comprises three main categories: DevOps (general), AIOps (log parsing, time series analysis, root cause analysis), and ToolLearning (function calling across various tools). The benchmark includes both zero-shot and few-shot evaluation settings, with specific data splits for development (few-shot examples) and testing. The evaluation framework allows users to integrate and test their own Hugging Face-formatted models by defining custom loader and context builder functions.

Quick Start & Requirements

  • Data Download: Download devopseval-exam.zip or load via Hugging Face datasets library (load_dataset("DevOps-Eval/devopseval-exam")).
  • Evaluation: Run python src/run_eval.py with specified model paths, configurations, and dataset details.
  • Prerequisites: Python, Hugging Face datasets, pandas. Model-specific dependencies will vary.

Highlighted Details

  • Contains 7486 multiple-choice questions across 8 general DevOps categories.
  • Includes 2840 AIOps samples covering log parsing, time series anomaly detection, classification, forecasting, and root cause analysis.
  • Features 1509 ToolLearning samples across 59 fields and 239 tool scenes, compatible with OpenAI's Function Calling format.
  • Provides a public leaderboard for comparing model performance.

Maintenance & Community

The project is actively updated, with recent additions including ToolLearning samples and AIOps leaderboards. Links to Hugging Face and Chinese/English tutorials are provided.

Licensing & Compatibility

Licensed under the Apache License (Version 2.0). This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is still under development, with planned additions including more samples, harder difficulty levels, and an English version of the samples. The "Coming Soon" note for citation suggests the primary research paper is not yet published.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 21 hours ago
Feedback? Help us improve.