streaming  by mosaicml

Data streaming library for efficient neural network training

created 3 years ago
1,355 stars

Top 30.3% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides a high-performance, distributed data streaming solution for training neural networks on large datasets stored in cloud object storage. It's designed for researchers and engineers working with massive datasets, offering seamless integration with PyTorch and cloud providers to maximize training efficiency and scalability.

How It Works

The library converts raw data into a custom sharded format (MDS) optimized for efficient reading. It supports various compression codecs and cloud storage backends (AWS, OCI, GCS, Azure, etc.). A key innovation is its deterministic shuffling and sample ordering, which ensures reproducibility across different numbers of workers and enables instant mid-epoch resumption, significantly reducing downtime and egress costs.

Quick Start & Requirements

  • Install with pip install mosaicml-streaming.
  • Requires Python and PyTorch. Cloud storage credentials are needed for remote access.
  • Data conversion to MDS format is a prerequisite.

Highlighted Details

  • Supports seamless mixing of multiple datasets with configurable proportions.
  • Achieves true determinism for reproducible training runs across varying hardware configurations.
  • Enables instant mid-epoch resumption, saving significant compute time and costs.
  • Offers high throughput, comparable to or exceeding alternatives like WebDataset and PyTorch's ImageFolder.
  • Provides random access to samples, even before they are downloaded.

Maintenance & Community

The project is actively maintained by the MosaicML team, with contributions welcomed. Community support is available via Slack.

Licensing & Compatibility

The library is released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The primary requirement is data conversion to the MDS format, which can be a significant upfront effort for existing datasets. While it supports many cloud providers, specific configurations might require careful setup.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
6
Issues (30d)
0
Star History
73 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alexander Wettig Alexander Wettig(Author of SWE-bench, SWE-agent), and
2 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
created 2 years ago
updated 1 day ago
Feedback? Help us improve.