deltacat  by ray-project

Portable multimodal lakehouse for exabyte-scale data

Created 4 years ago
255 stars

Top 98.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

DeltaCAT addresses the challenge of building and managing scalable, ACID-compliant multimodal data lakes. It empowers engineers and researchers to handle exabyte-scale production data for ML and analytics workloads, offering features like transactional data management, time travel, and zero-copy processing of diverse data types (images, audio, video, text). DeltaCAT provides a robust foundation for reliable and efficient data lake operations, running seamlessly from local development environments to cloud-scale deployments.

How It Works

DeltaCAT is built on Ray, Apache Arrow, and Daft, integrating a Catalog, Compute, and Storage layer. The Catalog provides Pythonic APIs for data discovery and management, while the Compute layer automates dataset optimization and distributed data management. The Storage layer defines a portable, multimodal data lake format compatible with any filesystem, eliminating the need for external catalog services or lock managers. This architecture enables zero-copy schema evolution and multimodal file processing, allowing data to be managed efficiently across various formats and scales.

Quick Start & Requirements

  • Install: pip install deltacat
  • Prerequisites: Ray (automatically initialized), Python. Runs locally or in cloud environments where Ray is supported. Any PyArrow-compatible filesystem can be used.
  • Setup: Local setup is straightforward via pip. Cloud deployment leverages Ray's distributed capabilities.
  • Docs: The README provides extensive examples and API documentation.

Highlighted Details

  • ACID Transactions & Time Travel: Enables data lake-level transactions and point-in-time queries across multiple tables and namespaces.
  • Zero-Copy Multimodal Processing: Efficiently handles images, audio, video, and text without data duplication.
  • Schema Evolution: Supports zero-copy schema evolution for adapting to changing data structures.
  • Universal Filesystem Support: Operates on any PyArrow-compatible filesystem (local, S3, GCS, Azure Blob Storage), enhancing portability.
  • Broad Data Format Compatibility: Reads and writes data using Pandas, NumPy, Polars, PyArrow, Ray Data, and Daft.
  • Namespace Organization: Allows logical grouping of tables within catalogs for better management.
  • Multi-Table Transactions: Ensures atomicity for operations spanning multiple datasets.

Maintenance & Community

No specific community links (e.g., Discord, Slack) or details on maintenance frequency were found in the provided README.

Licensing & Compatibility

The README does not explicitly state the project's license. This is a critical omission for assessing commercial compatibility and usage restrictions.

Limitations & Caveats

Local laptop usage is recommended for testing and experimental purposes due to potential system clock drift; production deployments require strong read-after-write filesystem consistency guarantees. The Sync component for synchronizing with other table formats is noted as being in development.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
2
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
12 more.

datatrove by huggingface

0.2%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 5 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

towhee by towhee-io

0.1%
3k
Framework for neural data processing pipelines
Created 4 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alexander Wettig Alexander Wettig(Coauthor of SWE-bench, SWE-agent), and
5 more.

data-juicer by datajuicer

0.5%
6k
Data-Juicer: Data processing system for foundation models
Created 2 years ago
Updated 3 days ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
26 more.

datasets by huggingface

0.1%
21k
Access and process large AI datasets efficiently
Created 5 years ago
Updated 3 days ago
Feedback? Help us improve.