petastorm by uber

Data access library for deep learning model training/evaluation from Apache Parquet datasets

Created 7 years ago

1,871 stars

Top 23.0% on SourcePulse

View on GitHub

10 Experts Love This Project

Cofounder of Lightning AI

Wes McKinney

Author of Pandas

and 6 more!

Project Summary

Petastorm is a data access library designed for efficient deep learning model training and evaluation. It enables direct data consumption from Apache Parquet datasets, supporting popular frameworks like TensorFlow, PyTorch, and PySpark, as well as pure Python.

How It Works

Petastorm leverages Apache Parquet as its storage format, augmenting it with higher-level schema information to treat multidimensional arrays as native data types. It supports extensible data codecs for compression (e.g., JPEG, PNG) and custom implementations. Data generation is typically performed using PySpark, which natively handles Parquet and scales from single machines to clusters. The library provides a Reader class for data access, offering features like selective column readout, parallelism control, row filtering, shuffling, and local caching.

Quick Start & Requirements

Install via pip: pip install petastorm
Optional extras for specific frameworks and libraries: petastorm[tf], petastorm[tf_gpu], petastorm[torch], petastorm[opencv].
Dataset generation requires PySpark.
Official documentation: https://petastorm.readthedocs.io/

Highlighted Details

Seamless integration with TensorFlow (tf.data.Dataset) and PyTorch (DataLoader).
Supports reading directly from any Parquet store using make_batch_reader.
Enables analysis and manipulation of datasets using PySpark and SQL.
Offers BatchedDataLoader and InMemBatchedDataLoader for PyTorch with improved throughput and memory caching.

Maintenance & Community

Active development and contributions from Uber ATG.
Issue tracking and contribution guidelines available on GitHub: https://github.com/uber/petastorm/issues

Licensing & Compatibility

Licensed under the Apache 2.0 License.
Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

Dataset generation is primarily done via PySpark, which may be a barrier for users not familiar with the framework.
make_batch_reader has limited support for native Parquet column types.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days