Data access library for deep learning model training/evaluation from Apache Parquet datasets
Top 23.9% on sourcepulse
Petastorm is a data access library designed for efficient deep learning model training and evaluation. It enables direct data consumption from Apache Parquet datasets, supporting popular frameworks like TensorFlow, PyTorch, and PySpark, as well as pure Python.
How It Works
Petastorm leverages Apache Parquet as its storage format, augmenting it with higher-level schema information to treat multidimensional arrays as native data types. It supports extensible data codecs for compression (e.g., JPEG, PNG) and custom implementations. Data generation is typically performed using PySpark, which natively handles Parquet and scales from single machines to clusters. The library provides a Reader
class for data access, offering features like selective column readout, parallelism control, row filtering, shuffling, and local caching.
Quick Start & Requirements
pip install petastorm
petastorm[tf]
, petastorm[tf_gpu]
, petastorm[torch]
, petastorm[opencv]
.Highlighted Details
tf.data.Dataset
) and PyTorch (DataLoader
).make_batch_reader
.BatchedDataLoader
and InMemBatchedDataLoader
for PyTorch with improved throughput and memory caching.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
make_batch_reader
has limited support for native Parquet column types.1 year ago
1 day