Data streaming library for efficient neural network training
Top 30.3% on sourcepulse
This library provides a high-performance, distributed data streaming solution for training neural networks on large datasets stored in cloud object storage. It's designed for researchers and engineers working with massive datasets, offering seamless integration with PyTorch and cloud providers to maximize training efficiency and scalability.
How It Works
The library converts raw data into a custom sharded format (MDS) optimized for efficient reading. It supports various compression codecs and cloud storage backends (AWS, OCI, GCS, Azure, etc.). A key innovation is its deterministic shuffling and sample ordering, which ensures reproducibility across different numbers of workers and enables instant mid-epoch resumption, significantly reducing downtime and egress costs.
Quick Start & Requirements
pip install mosaicml-streaming
.Highlighted Details
Maintenance & Community
The project is actively maintained by the MosaicML team, with contributions welcomed. Community support is available via Slack.
Licensing & Compatibility
The library is released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The primary requirement is data conversion to the MDS format, which can be a significant upfront effort for existing datasets. While it supports many cloud providers, specific configurations might require careful setup.
2 weeks ago
1 day