SDK for scaling data transforms and optimizing datasets for fast AI training
Top 61.6% on sourcepulse
LitData is an open-source Python library designed to accelerate AI model training by optimizing and streaming datasets. It targets ML engineers and researchers working with large datasets, enabling faster data processing, efficient cloud data utilization, and seamless integration with popular ML frameworks like PyTorch Lightning.
How It Works
LitData offers two primary modes: optimize
for transforming datasets into a highly efficient, chunked binary format, and map
for parallelizing data processing tasks across multiple machines. The optimize
process significantly speeds up data loading (up to 20x faster than non-optimized data) by preparing data for direct streaming from cloud storage (S3, GCS, Azure) without local downloads. This approach leverages parallel processing and efficient data serialization to minimize I/O bottlenecks during training.
Quick Start & Requirements
pip install litdata
or pip install 'litdata[extras]'
for all features.s5cmd
for S3.Highlighted Details
Maintenance & Community
LitData is an active community project with maintainers from Lightning AI. Support and discussion are available via their Discord server.
Licensing & Compatibility
Limitations & Caveats
optimize
function requires data to be processed into a specific chunked binary format, which may involve an initial conversion step.2 days ago
1 day