deltacat  by ray-project

Portable multimodal lakehouse for exabyte-scale data

Created 4 years ago
269 stars

Top 95.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

DeltaCAT addresses the challenge of building and managing scalable, ACID-compliant multimodal data lakes. It empowers engineers and researchers to handle exabyte-scale production data for ML and analytics workloads, offering features like transactional data management, time travel, and zero-copy processing of diverse data types (images, audio, video, text). DeltaCAT provides a robust foundation for reliable and efficient data lake operations, running seamlessly from local development environments to cloud-scale deployments.

How It Works

DeltaCAT is built on Ray, Apache Arrow, and Daft, integrating a Catalog, Compute, and Storage layer. The Catalog provides Pythonic APIs for data discovery and management, while the Compute layer automates dataset optimization and distributed data management. The Storage layer defines a portable, multimodal data lake format compatible with any filesystem, eliminating the need for external catalog services or lock managers. This architecture enables zero-copy schema evolution and multimodal file processing, allowing data to be managed efficiently across various formats and scales.

Quick Start & Requirements

  • Install: pip install deltacat
  • Prerequisites: Ray (automatically initialized), Python. Runs locally or in cloud environments where Ray is supported. Any PyArrow-compatible filesystem can be used.
  • Setup: Local setup is straightforward via pip. Cloud deployment leverages Ray's distributed capabilities.
  • Docs: The README provides extensive examples and API documentation.

Highlighted Details

  • ACID Transactions & Time Travel: Enables data lake-level transactions and point-in-time queries across multiple tables and namespaces.
  • Zero-Copy Multimodal Processing: Efficiently handles images, audio, video, and text without data duplication.
  • Schema Evolution: Supports zero-copy schema evolution for adapting to changing data structures.
  • Universal Filesystem Support: Operates on any PyArrow-compatible filesystem (local, S3, GCS, Azure Blob Storage), enhancing portability.
  • Broad Data Format Compatibility: Reads and writes data using Pandas, NumPy, Polars, PyArrow, Ray Data, and Daft.
  • Namespace Organization: Allows logical grouping of tables within catalogs for better management.
  • Multi-Table Transactions: Ensures atomicity for operations spanning multiple datasets.

Maintenance & Community

No specific community links (e.g., Discord, Slack) or details on maintenance frequency were found in the provided README.

Licensing & Compatibility

The README does not explicitly state the project's license. This is a critical omission for assessing commercial compatibility and usage restrictions.

Limitations & Caveats

Local laptop usage is recommended for testing and experimental purposes due to potential system clock drift; production deployments require strong read-after-write filesystem consistency guarantees. The Sync component for synchronizing with other table formats is noted as being in development.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
1
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alexander Wettig Alexander Wettig(Coauthor of SWE-bench, SWE-agent), and
5 more.

data-juicer by datajuicer

0.6%
6k
Data-Juicer: Data processing system for foundation models
Created 2 years ago
Updated 22 hours ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
26 more.

datasets by huggingface

0.1%
21k
Access and process large AI datasets efficiently
Created 6 years ago
Updated 1 day ago
Feedback? Help us improve.