opensre  by Tracer-Cloud

AI SRE agents for automated incident investigation

Created 2 months ago
509 stars

Top 61.2% on SourcePulse

GitHubView on GitHub
Project Summary

Open SRE is an open-source framework designed to empower users to build AI-powered Site Reliability Engineering (SRE) agents. It automates incident investigation and root cause analysis by correlating scattered evidence across disparate systems. This toolkit is particularly beneficial for production data engineering teams managing complex data platforms, aiming to significantly reduce Mean Time To Resolution (MTTR) by automating manual investigation processes.

How It Works

Tracer operates by ingesting alerts from monitoring systems, then assembling context from logs, metrics, configurations, and dependencies. It frames potential failure modes, executes investigation queries across connected systems in parallel, and evaluates hypotheses based on collected evidence. The framework delivers evidence-backed root cause reports, enabling structured incident investigation and cross-system failure correlation. Its design prioritizes deterministic investigations, parallel hypothesis testing, and production-first robustness.

Quick Start & Requirements

Installation involves cloning the repository, navigating to the directory, and running make install followed by make install-hooks. Configuration is handled via cp .env.example .env and then executing run opensre onboard to set up a local LLM provider and integrate with services like Grafana, Datadog, Slack, AWS, GitHub MCP, and Sentry. Several demo options are available, including a Local Grafana RCA Demo (run with make local-grafana-live) and a Bundled Local RCA Demo for a quick start without Docker. A Full Local Setup Guide is also provided for connecting custom systems or running the LangGraph dev UI.

Highlighted Details

  • Extensive Integrations: Supports a wide array of platforms including Apache Airflow, Kafka, Spark, Grafana, Datadog, CloudWatch, Sentry, Kubernetes, AWS, GCP, Azure, GitHub, Slack, and PagerDuty.
  • Automated RCA: Delivers root cause reports directly to Slack out-of-the-box, with straightforward extensibility for other communication platforms like PagerDuty or OpsGenie.
  • Design Principles: Emphasizes deterministic investigations, evidence-backed conclusions, parallel hypothesis testing, production-first design, and fully auditable workflows.

Maintenance & Community

The project welcomes contributors, with details available in CONTRIBUTING.md. Community resources and documentation links include Slack, Getting Started guides, Tracer Agent information, Docs, FAQ, and Security details.

Licensing & Compatibility

The project is licensed under the Apache License 2.0. This license is permissive and generally compatible with commercial use and linking within closed-source projects.

Limitations & Caveats

Tracer interacts with production systems, necessitating careful security configurations. The project strongly recommends using read-only credentials, restricting network exposure, logging all investigations, and reviewing reports before automated remediation actions are taken.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
197
Issues (30d)
191
Star History
471 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.