litData by Lightning-AI

SDK for scaling data transforms and optimizing datasets for fast AI training

Created 1 year ago

568 stars

Top 56.6% on SourcePulse

View on GitHub

3 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Jeff Hammerbacher

Cofounder of Cloudera

Luca Antiga

CTO of Lightning AI

Project Summary

LitData is an open-source Python library designed to accelerate AI model training by optimizing and streaming datasets. It targets ML engineers and researchers working with large datasets, enabling faster data processing, efficient cloud data utilization, and seamless integration with popular ML frameworks like PyTorch Lightning.

How It Works

LitData offers two primary modes: optimize for transforming datasets into a highly efficient, chunked binary format, and map for parallelizing data processing tasks across multiple machines. The optimize process significantly speeds up data loading (up to 20x faster than non-optimized data) by preparing data for direct streaming from cloud storage (S3, GCS, Azure) without local downloads. This approach leverages parallel processing and efficient data serialization to minimize I/O bottlenecks during training.

Quick Start & Requirements

Install: pip install litdata or pip install 'litdata[extras]' for all features.
Prerequisites: Python, cloud storage access (S3, GCS, Azure), optional s5cmd for S3.
Links: Quick start, Docs, Benchmarks, Templates.

Highlighted Details

Achieves up to 20x faster streaming speeds compared to non-optimized data and 2x over other streaming solutions like WebDataset and MosaicML.
Optimizes datasets 3-5x faster than competing frameworks.
Supports direct streaming of Hugging Face datasets and Parquet files.
Enables distributed data processing and optimization across multiple nodes.
Features include resumable streaming, data encryption, dataset merging/splitting, and flexible caching.

Maintenance & Community

LitData is an active community project with maintainers from Lightning AI. Support and discussion are available via their Discord server.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Compatible with commercial and closed-source applications.

Limitations & Caveats

Hugging Face dataset streaming is currently limited to datasets in Parquet format.
The optimize function requires data to be processed into a specific chunked binary format, which may involve an initial conversion step.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days