deltacat by ray-project

Portable multimodal lakehouse for exabyte-scale data

Created 4 years ago

269 stars

Top 95.6% on SourcePulse

View on GitHub

3 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Wes McKinney

Author of Pandas

Philipp Moritz

Cofounder of Anyscale

Project Summary

Summary

DeltaCAT addresses the challenge of building and managing scalable, ACID-compliant multimodal data lakes. It empowers engineers and researchers to handle exabyte-scale production data for ML and analytics workloads, offering features like transactional data management, time travel, and zero-copy processing of diverse data types (images, audio, video, text). DeltaCAT provides a robust foundation for reliable and efficient data lake operations, running seamlessly from local development environments to cloud-scale deployments.

How It Works

DeltaCAT is built on Ray, Apache Arrow, and Daft, integrating a Catalog, Compute, and Storage layer. The Catalog provides Pythonic APIs for data discovery and management, while the Compute layer automates dataset optimization and distributed data management. The Storage layer defines a portable, multimodal data lake format compatible with any filesystem, eliminating the need for external catalog services or lock managers. This architecture enables zero-copy schema evolution and multimodal file processing, allowing data to be managed efficiently across various formats and scales.

Quick Start & Requirements

Install: pip install deltacat
Prerequisites: Ray (automatically initialized), Python. Runs locally or in cloud environments where Ray is supported. Any PyArrow-compatible filesystem can be used.
Setup: Local setup is straightforward via pip. Cloud deployment leverages Ray's distributed capabilities.
Docs: The README provides extensive examples and API documentation.

Highlighted Details

ACID Transactions & Time Travel: Enables data lake-level transactions and point-in-time queries across multiple tables and namespaces.
Zero-Copy Multimodal Processing: Efficiently handles images, audio, video, and text without data duplication.
Schema Evolution: Supports zero-copy schema evolution for adapting to changing data structures.
Universal Filesystem Support: Operates on any PyArrow-compatible filesystem (local, S3, GCS, Azure Blob Storage), enhancing portability.
Broad Data Format Compatibility: Reads and writes data using Pandas, NumPy, Polars, PyArrow, Ray Data, and Daft.
Namespace Organization: Allows logical grouping of tables within catalogs for better management.
Multi-Table Transactions: Ensures atomicity for operations spanning multiple datasets.

Maintenance & Community

No specific community links (e.g., Discord, Slack) or details on maintenance frequency were found in the provided README.

Licensing & Compatibility

The README does not explicitly state the project's license. This is a critical omission for assessing commercial compatibility and usage restrictions.

Limitations & Caveats

Local laptop usage is recommended for testing and experimental purposes due to potential system clock drift; production deployments require strong read-after-write filesystem consistency guarantees. The Sync component for synchronizing with other table formats is noted as being in development.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days