webdataset by webdataset

High-performance I/O system for large deep learning problems, strong PyTorch support

Created 6 years ago

2,952 stars

Top 16.1% on SourcePulse

View on GitHub

18 Experts Love This Project

Luca Soldaini

Research Scientist at Ai2

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Jeff Hammerbacher

Cofounder of Cloudera

Théophile Gervet

Cofounder of Genesis AI

and 14 more!

Project Summary

WebDataset is a high-performance Python I/O system designed for large-scale deep learning datasets. It offers a streaming approach to data loading, optimized for sequential reads from local disks and cloud object stores, benefiting researchers and engineers working with massive datasets who need efficient data pipelines.

How It Works

WebDataset stores data in tar archives, where related files (e.g., image and label) share a common base name. This sequential format maximizes I/O throughput, especially from cloud storage. It leverages PyTorch's IterableDataset for stream processing, allowing for flexible pipeline construction with features like shuffling, decoding (PIL, torchvision, etc.), and transformations.

Quick Start & Requirements

Install: pip install webdataset
Dependencies: PyTorch, NumPy, braceexpand. Optional dependencies (PIL/Pillow, torchvision, etc.) are loaded dynamically based on decoding needs. Cloud CLI tools (curl, gsutil, awscli, azcli) are required for cloud storage access.
Documentation: WebDataset Format Specification, Notebooks

Highlighted Details

Supports local disk, pipes, and cloud object storage (GCS, S3, Azure).
Compatible with PyTorch, TensorFlow, and Jax.
Offers both a concise "fluid" API and an explicit DataPipeline API.
Includes wids library for indexed/random access, useful for legacy code or specific sampling needs.

Maintenance & Community

The project is undergoing a refactoring into webdataset, wids, and wsds libraries, with a planned switchover in March 2025.

Licensing & Compatibility

The library is released under a permissive license, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The IterableDataset approach can make achieving perfectly balanced sample counts across nodes for fixed epochs tricky, often requiring shard resampling. The wids library, while offering indexed access, requires a metadata file and local storage for shards, and its API is not fully consistent with the main webdataset library.

Health Check

Last Commit

6 months ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

37 stars in the last 30 days