streaming by mosaicml

Data streaming library for efficient neural network training

Created 3 years ago

1,440 stars

Top 28.1% on SourcePulse

View on GitHub

9 Experts Love This Project

Kaichao You

Core Maintainer of vLLM

Yaowei Zheng

Author of LLaMA-Factory

Jesse Clark

Cofounder of Marqo

John Mullan

MTS at xAI; Cofounder of Hotshot AI

and 5 more!

Project Summary

This library provides a high-performance, distributed data streaming solution for training neural networks on large datasets stored in cloud object storage. It's designed for researchers and engineers working with massive datasets, offering seamless integration with PyTorch and cloud providers to maximize training efficiency and scalability.

How It Works

The library converts raw data into a custom sharded format (MDS) optimized for efficient reading. It supports various compression codecs and cloud storage backends (AWS, OCI, GCS, Azure, etc.). A key innovation is its deterministic shuffling and sample ordering, which ensures reproducibility across different numbers of workers and enables instant mid-epoch resumption, significantly reducing downtime and egress costs.

Quick Start & Requirements

Install with pip install mosaicml-streaming.
Requires Python and PyTorch. Cloud storage credentials are needed for remote access.
Data conversion to MDS format is a prerequisite.

Highlighted Details

Supports seamless mixing of multiple datasets with configurable proportions.
Achieves true determinism for reproducible training runs across varying hardware configurations.
Enables instant mid-epoch resumption, saving significant compute time and costs.
Offers high throughput, comparable to or exceeding alternatives like WebDataset and PyTorch's ImageFolder.
Provides random access to samples, even before they are downloaded.

Maintenance & Community

The project is actively maintained by the MosaicML team, with contributions welcomed. Community support is available via Slack.

Licensing & Compatibility

The library is released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The primary requirement is data conversion to the MDS format, which can be a significant upfront effort for existing datasets. While it supports many cloud providers, specific configurations might require careful setup.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days