High-performance I/O system for large deep learning problems, strong PyTorch support
Top 17.8% on sourcepulse
WebDataset is a high-performance Python I/O system designed for large-scale deep learning datasets. It offers a streaming approach to data loading, optimized for sequential reads from local disks and cloud object stores, benefiting researchers and engineers working with massive datasets who need efficient data pipelines.
How It Works
WebDataset stores data in tar archives, where related files (e.g., image and label) share a common base name. This sequential format maximizes I/O throughput, especially from cloud storage. It leverages PyTorch's IterableDataset
for stream processing, allowing for flexible pipeline construction with features like shuffling, decoding (PIL, torchvision, etc.), and transformations.
Quick Start & Requirements
pip install webdataset
braceexpand
. Optional dependencies (PIL/Pillow, torchvision, etc.) are loaded dynamically based on decoding needs. Cloud CLI tools (curl, gsutil, awscli, azcli) are required for cloud storage access.Highlighted Details
DataPipeline
API.wids
library for indexed/random access, useful for legacy code or specific sampling needs.Maintenance & Community
The project is undergoing a refactoring into webdataset
, wids
, and wsds
libraries, with a planned switchover in March 2025.
Licensing & Compatibility
The library is released under a permissive license, allowing for commercial use and integration with closed-source projects.
Limitations & Caveats
The IterableDataset
approach can make achieving perfectly balanced sample counts across nodes for fixed epochs tricky, often requiring shard resampling. The wids
library, while offering indexed access, requires a metadata file and local storage for shards, and its API is not fully consistent with the main webdataset
library.
1 month ago
Inactive