MicroRCA-Agent  by tangpan360

AI agent for microservice root cause analysis using multi-modal data

Created 10 months ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project addresses the complex challenge of microservice root cause analysis by leveraging a multi-modal data approach powered by Large Language Model (LLM) agents. It is designed for engineers and researchers working with microservice architectures who need to quickly identify and diagnose faults. The solution offers a structured, reasoning-trace-backed output, aiming to provide a complete closed loop from fault observation to root cause identification, as demonstrated by its Top 5 ranking in the 2025 CCF International AIOps Challenge.

How It Works

The system employs a modular architecture comprising five core components: data preprocessing, log fault extraction, trace fault detection, metric fault summarization, and multi-modal root cause analysis. Data interaction between these loosely coupled modules is managed through function encapsulation. Key techniques include the Drain3 algorithm for efficient log templating and data volume reduction, IsolationForest for detecting anomalies in trace durations, and LLM-based summarization for analyzing both application performance monitoring (APM) and infrastructure metrics. This multi-modal fusion, combined with LLM agents for reasoning, enables a comprehensive analysis across logs, traces, and metrics.

Quick Start & Requirements

  • Installation: Requires Git LFS for managing large files. Python dependencies are listed in src/requirements.txt and should be installed, preferably within a Conda environment (Python 3.10 recommended).
  • Prerequisites:
    • Git LFS installation.
    • Python 3.10+ environment.
    • DeepSeek LLM API keys (KEJIYUN_API_KEY, KEJIYUN_API_BASE) configured in src/.env.
    • Model configuration in src/agent/llm_config.py (default: deepseek-chat).
  • Execution: Run via bash run.sh after completing setup and configuration.
  • Resources: Recommended for high-performance machines due to large data processing. Memory usage requires monitoring, especially when adjusting the multi-process architecture.
  • Links: Evaluation Platform: https://challenge.aiops.cn/home/competition/1963605668447416345

Highlighted Details

  • Achieved Top 5 in the 2025 CCF International AIOps Challenge (Track 1).
  • Processes Log, Trace, and Metric data for fault analysis.
  • Modular design with function encapsulation for scalability and independence.
  • Outputs structured root cause analysis including component, reason, and a detailed reasoning_trace.
  • Employs parallel computing via a multi-process architecture.
  • Includes fault tolerance mechanisms for retries, exception isolation, and data missing scenarios.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or detailed maintenance information (e.g., recent commit activity, roadmap) are provided in the README. The authors are listed as Tang, Pan; Tang, Shixiang; Pu, Huanqi; Miao, Zhiqing; and Wang, Zhixing.

Licensing & Compatibility

The license type and any associated compatibility notes for commercial use or closed-source linking are not explicitly stated in the provided README content.

Limitations & Caveats

The project acknowledges that optimal root cause localization accuracy requires deeper integration of domain-specific Operation and Maintenance (O&M) knowledge, such as enterprise experience, key business indicators, and standardized fault diagnosis SOPs. Without access to such O&M resources, the current solution's accuracy may be limited. The system is also dependent on external LLM API accessibility.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.