rrcf  by kLabUM

Anomaly detection SDK using Robust Random Cut Forest algorithm

created 6 years ago
514 stars

Top 61.7% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides a Python implementation of the Robust Random Cut Forest (RRCF) algorithm for anomaly detection in streaming data. It is designed for researchers and practitioners working with time-series data who need to identify outliers efficiently, even in high-dimensional or noisy datasets. The RRCF algorithm offers a statistically grounded anomaly score and handles data characteristics that often challenge other methods.

How It Works

The core of the library is the RCTree class, which builds robust random cut trees. These trees are binary search trees that recursively partition data points. Anomaly detection is performed by calculating the "collusive displacement" (CoDisp) of a point, which measures how much the tree's structure changes when the point is added. Higher CoDisp indicates a higher likelihood of being an outlier. The library supports both batch and streaming anomaly detection by constructing an ensemble of these trees.

Quick Start & Requirements

  • Installation: pip install rrcf
  • Dependencies: numpy (>= 1.15). Optional: pandas, scipy, scikit-learn, matplotlib for examples.
  • Documentation: Read the docs here 📖

Highlighted Details

  • Implements the RRCF algorithm by Guha et al. (2016).
  • Supports streaming data with efficient point insertion and removal.
  • Provides methods for calculating anomaly scores (CoDisp) and feature importance.
  • Includes examples for batch and streaming anomaly detection, and feature importance estimation.

Maintenance & Community

  • The project was cited in the Journal of Open Source Software in 2019.
  • Contributions are welcomed via pull requests to the dev branch.
  • Issues can be raised for bugs or feature requests.

Licensing & Compatibility

  • The library is available under an unspecified license. (Note: The README does not explicitly state the license type, which is a significant omission for due diligence.)
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not explicitly state the license, making it difficult to assess commercial usability or derivative works. While it mentions contributing guidelines and testing, there's no clear indication of recent activity or active maintenance beyond the initial publication.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Samuel Colvin Samuel Colvin(Author of Pydantic, Pydantic Logfire, PydanticAI), and
4 more.

quokka by marsupialtail

0.1%
1k
Distributed query engine for time series data
created 3 years ago
updated 11 months ago
Feedback? Help us improve.