webdataset  by webdataset

High-performance I/O system for large deep learning problems, strong PyTorch support

created 6 years ago
2,734 stars

Top 17.8% on sourcepulse

GitHubView on GitHub
Project Summary

WebDataset is a high-performance Python I/O system designed for large-scale deep learning datasets. It offers a streaming approach to data loading, optimized for sequential reads from local disks and cloud object stores, benefiting researchers and engineers working with massive datasets who need efficient data pipelines.

How It Works

WebDataset stores data in tar archives, where related files (e.g., image and label) share a common base name. This sequential format maximizes I/O throughput, especially from cloud storage. It leverages PyTorch's IterableDataset for stream processing, allowing for flexible pipeline construction with features like shuffling, decoding (PIL, torchvision, etc.), and transformations.

Quick Start & Requirements

  • Install: pip install webdataset
  • Dependencies: PyTorch, NumPy, braceexpand. Optional dependencies (PIL/Pillow, torchvision, etc.) are loaded dynamically based on decoding needs. Cloud CLI tools (curl, gsutil, awscli, azcli) are required for cloud storage access.
  • Documentation: WebDataset Format Specification, Notebooks

Highlighted Details

  • Supports local disk, pipes, and cloud object storage (GCS, S3, Azure).
  • Compatible with PyTorch, TensorFlow, and Jax.
  • Offers both a concise "fluid" API and an explicit DataPipeline API.
  • Includes wids library for indexed/random access, useful for legacy code or specific sampling needs.

Maintenance & Community

The project is undergoing a refactoring into webdataset, wids, and wsds libraries, with a planned switchover in March 2025.

Licensing & Compatibility

The library is released under a permissive license, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The IterableDataset approach can make achieving perfectly balanced sample counts across nodes for fixed epochs tricky, often requiring shard resampling. The wids library, while offering indexed access, requires a metadata file and local storage for shards, and its API is not fully consistent with the main webdataset library.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
173 stars in the last 90 days

Explore Similar Projects

Starred by Chris Van Pelt Chris Van Pelt(Cofounder of Weights & Biases) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

tensorizer by coreweave

0%
250
Module for fast model serialization/deserialization
created 2 years ago
updated 14 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

towhee by towhee-io

0.2%
3k
Framework for neural data processing pipelines
created 4 years ago
updated 9 months ago
Feedback? Help us improve.