quokka by marsupialtail

Distributed query engine for time series data

Created 4 years ago

1,188 stars

Top 32.7% on SourcePulse

View on GitHub

9 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Chang She

Cofounder of LanceDB

Samuel Colvin

Founder and Author of Pydantic

David Cournapeau

Author of scikit-learn

and 5 more!

Project Summary

Quokka is a Python-native, push-based distributed query engine designed for high-performance time series analytics and complex event processing on large datasets. It targets data engineers and researchers working with time-series data, offering significant speedups over traditional engines like Spark for specific workloads, particularly those involving windowing, joins, and custom stateful computations.

How It Works

Quokka leverages a push-based execution model, allowing data partitions to be processed serially as they become available, enabling pipelining of shuffles and I/O for performance gains. It integrates multiple high-performance libraries: DuckDB and Polars for relational algebra kernels, Ray for distributed task scheduling, Arrow for efficient data interchange, and Redis for lineage logging. This architecture allows for complex time-series operations like asof/range joins and pattern recognition, while its Python-native implementation simplifies extensibility and UDF integration.

Quick Start & Requirements

Install: pip3 install pyquokka
Prerequisites: Redis >= 6.2 (installation instructions provided).
Documentation: https://marsupialtail.github.io/quokka/

Highlighted Details

Tick-level backtesting demonstrated on SIP trade streams.
Vector embedding analytics with support for formats like Lance.
Claims several times faster than SparkSQL on TPC-H queries.
Supports complex time-series operations like windowing, asof/range joins, and pattern recognition.

Maintenance & Community

Active development with contributions acknowledged.
Discord channel available for questions and discussion.
Encourages users to reach out and raise GitHub issues.

Licensing & Compatibility

License not explicitly stated in the README. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

Quokka is not a direct replacement for SparkSQL, as it does not yet parse SQL directly, though this is on the roadmap. The project encourages users to engage with the developers before deploying for critical use cases.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days