petastorm  by uber

Data access library for deep learning model training/evaluation from Apache Parquet datasets

created 7 years ago
1,848 stars

Top 23.9% on sourcepulse

GitHubView on GitHub
Project Summary

Petastorm is a data access library designed for efficient deep learning model training and evaluation. It enables direct data consumption from Apache Parquet datasets, supporting popular frameworks like TensorFlow, PyTorch, and PySpark, as well as pure Python.

How It Works

Petastorm leverages Apache Parquet as its storage format, augmenting it with higher-level schema information to treat multidimensional arrays as native data types. It supports extensible data codecs for compression (e.g., JPEG, PNG) and custom implementations. Data generation is typically performed using PySpark, which natively handles Parquet and scales from single machines to clusters. The library provides a Reader class for data access, offering features like selective column readout, parallelism control, row filtering, shuffling, and local caching.

Quick Start & Requirements

  • Install via pip: pip install petastorm
  • Optional extras for specific frameworks and libraries: petastorm[tf], petastorm[tf_gpu], petastorm[torch], petastorm[opencv].
  • Dataset generation requires PySpark.
  • Official documentation: https://petastorm.readthedocs.io/

Highlighted Details

  • Seamless integration with TensorFlow (tf.data.Dataset) and PyTorch (DataLoader).
  • Supports reading directly from any Parquet store using make_batch_reader.
  • Enables analysis and manipulation of datasets using PySpark and SQL.
  • Offers BatchedDataLoader and InMemBatchedDataLoader for PyTorch with improved throughput and memory caching.

Maintenance & Community

Licensing & Compatibility

  • Licensed under the Apache 2.0 License.
  • Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

  • Dataset generation is primarily done via PySpark, which may be a barrier for users not familiar with the framework.
  • make_batch_reader has limited support for native Parquet column types.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alexander Wettig Alexander Wettig(Author of SWE-bench, SWE-agent), and
2 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
created 2 years ago
updated 1 day ago
Feedback? Help us improve.