quokka  by marsupialtail

Distributed query engine for time series data

created 3 years ago
1,180 stars

Top 33.7% on sourcepulse

GitHubView on GitHub
Project Summary

Quokka is a Python-native, push-based distributed query engine designed for high-performance time series analytics and complex event processing on large datasets. It targets data engineers and researchers working with time-series data, offering significant speedups over traditional engines like Spark for specific workloads, particularly those involving windowing, joins, and custom stateful computations.

How It Works

Quokka leverages a push-based execution model, allowing data partitions to be processed serially as they become available, enabling pipelining of shuffles and I/O for performance gains. It integrates multiple high-performance libraries: DuckDB and Polars for relational algebra kernels, Ray for distributed task scheduling, Arrow for efficient data interchange, and Redis for lineage logging. This architecture allows for complex time-series operations like asof/range joins and pattern recognition, while its Python-native implementation simplifies extensibility and UDF integration.

Quick Start & Requirements

Highlighted Details

  • Tick-level backtesting demonstrated on SIP trade streams.
  • Vector embedding analytics with support for formats like Lance.
  • Claims several times faster than SparkSQL on TPC-H queries.
  • Supports complex time-series operations like windowing, asof/range joins, and pattern recognition.

Maintenance & Community

  • Active development with contributions acknowledged.
  • Discord channel available for questions and discussion.
  • Encourages users to reach out and raise GitHub issues.

Licensing & Compatibility

  • License not explicitly stated in the README. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

Quokka is not a direct replacement for SparkSQL, as it does not yet parse SQL directly, though this is on the roadmap. The project encourages users to engage with the developers before deploying for critical use cases.

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
14 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

bytewax by bytewax

0.3%
2k
Python framework for stateful stream processing
created 3 years ago
updated 4 months ago
Feedback? Help us improve.