MicroRCA-Agent by tangpan360

AI agent for microservice root cause analysis using multi-modal data

Created 10 months ago

251 stars

Top 99.8% on SourcePulse

Project Summary

This project addresses the complex challenge of microservice root cause analysis by leveraging a multi-modal data approach powered by Large Language Model (LLM) agents. It is designed for engineers and researchers working with microservice architectures who need to quickly identify and diagnose faults. The solution offers a structured, reasoning-trace-backed output, aiming to provide a complete closed loop from fault observation to root cause identification, as demonstrated by its Top 5 ranking in the 2025 CCF International AIOps Challenge.

How It Works

The system employs a modular architecture comprising five core components: data preprocessing, log fault extraction, trace fault detection, metric fault summarization, and multi-modal root cause analysis. Data interaction between these loosely coupled modules is managed through function encapsulation. Key techniques include the Drain3 algorithm for efficient log templating and data volume reduction, IsolationForest for detecting anomalies in trace durations, and LLM-based summarization for analyzing both application performance monitoring (APM) and infrastructure metrics. This multi-modal fusion, combined with LLM agents for reasoning, enables a comprehensive analysis across logs, traces, and metrics.

Quick Start & Requirements

Installation: Requires Git LFS for managing large files. Python dependencies are listed in src/requirements.txt and should be installed, preferably within a Conda environment (Python 3.10 recommended).
Prerequisites:
- Git LFS installation.
- Python 3.10+ environment.
- DeepSeek LLM API keys (KEJIYUN_API_KEY, KEJIYUN_API_BASE) configured in src/.env.
- Model configuration in src/agent/llm_config.py (default: deepseek-chat).
Execution: Run via bash run.sh after completing setup and configuration.
Resources: Recommended for high-performance machines due to large data processing. Memory usage requires monitoring, especially when adjusting the multi-process architecture.
Links: Evaluation Platform: https://challenge.aiops.cn/home/competition/1963605668447416345

Highlighted Details

Achieved Top 5 in the 2025 CCF International AIOps Challenge (Track 1).
Processes Log, Trace, and Metric data for fault analysis.
Modular design with function encapsulation for scalability and independence.
Outputs structured root cause analysis including component, reason, and a detailed reasoning_trace.
Employs parallel computing via a multi-process architecture.
Includes fault tolerance mechanisms for retries, exception isolation, and data missing scenarios.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or detailed maintenance information (e.g., recent commit activity, roadmap) are provided in the README. The authors are listed as Tang, Pan; Tang, Shixiang; Pu, Huanqi; Miao, Zhiqing; and Wang, Zhixing.

Licensing & Compatibility

The license type and any associated compatibility notes for commercial use or closed-source linking are not explicitly stated in the provided README content.

Limitations & Caveats

The project acknowledges that optimal root cause localization accuracy requires deeper integration of domain-specific Operation and Maintenance (O&M) knowledge, such as enterprise experience, key business indicators, and standardized fault diagnosis SOPs. Without access to such O&M resources, the current solution's accuracy may be limited. The system is also dependent on external LLM API accessibility.

MicroRCA-Agent by tangpan360

Explore Similar Projects

AgentEval by canwhite

Agents_Failure_Attribution by ag2ai

OpenRCA by microsoft

workshop by raindrop-ai

aurora by Arvo-AI

awesome-LLM-AIOps by Jun-jie-Huang

superlog by superloglabs

ongrid by ongridio

incidentfox by incidentfox

traceroot by traceroot-ai

coze-loop by coze-dev

opensre by Tracer-Cloud