Discover and explore top open-source AI tools and projects—updated daily.
uberData access library for deep learning model training/evaluation from Apache Parquet datasets
Top 23.2% on SourcePulse
Petastorm is a data access library designed for efficient deep learning model training and evaluation. It enables direct data consumption from Apache Parquet datasets, supporting popular frameworks like TensorFlow, PyTorch, and PySpark, as well as pure Python.
How It Works
Petastorm leverages Apache Parquet as its storage format, augmenting it with higher-level schema information to treat multidimensional arrays as native data types. It supports extensible data codecs for compression (e.g., JPEG, PNG) and custom implementations. Data generation is typically performed using PySpark, which natively handles Parquet and scales from single machines to clusters. The library provides a Reader class for data access, offering features like selective column readout, parallelism control, row filtering, shuffling, and local caching.
Quick Start & Requirements
pip install petastormpetastorm[tf], petastorm[tf_gpu], petastorm[torch], petastorm[opencv].Highlighted Details
tf.data.Dataset) and PyTorch (DataLoader).make_batch_reader.BatchedDataLoader and InMemBatchedDataLoader for PyTorch with improved throughput and memory caching.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
make_batch_reader has limited support for native Parquet column types.3 days ago
Inactive
microsoft
google
Lightning-AI
mlfoundations
webdataset
modelscope
activeloopai
huggingface