sail by lakehq

Computation framework unifying data processing and AI workloads

Created 2 years ago

1,110 stars

Top 34.4% on SourcePulse

View on GitHub

3 Experts Love This Project

Emil Ernerfeldt

Cofounder of Rerun

Jeff Hammerbacher

Cofounder of Cloudera

Wes McKinney

Author of Pandas

Project Summary

Sail is a computation framework designed to unify batch processing, stream processing, and AI workloads, offering a drop-in replacement for Spark SQL and the Spark DataFrame API. It targets data engineers and AI practitioners seeking to streamline complex data pipelines and improve performance.

How It Works

Sail acts as a Spark Connect server, enabling existing PySpark applications to connect to a Sail backend without code modifications. This approach leverages the familiar Spark API while introducing performance optimizations and potentially reducing infrastructure costs.

Quick Start & Requirements

Install: pip install "pysail[spark]"
Prerequisites: Python. Installation from source is recommended for optimized performance.
Getting Started: Connect to a running Sail server using SparkSession.builder.remote("sc://localhost:50051").getOrCreate().
Documentation: https://docs.lake.sail.io/

Highlighted Details

Claims up to 4x speed improvement and 94% cost reduction compared to standard Spark in benchmarks.
Supports both single-host and distributed deployments, including Kubernetes.
Offers an MCP server for integrating Spark data analytics with LLM agents.

Maintenance & Community

Active development with contributions welcomed via GitHub issues, discussions, and pull requests.
Enterprise support is available from LakeSail.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a replacement for Spark SQL and DataFrame API, implying a dependency on the Spark ecosystem. Specific limitations regarding supported Spark versions or feature parity are not detailed in the README.

Health Check

Last Commit

15 hours ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

21 stars in the last 30 days