litData  by Lightning-AI

SDK for scaling data transforms and optimizing datasets for fast AI training

Created 1 year ago
568 stars

Top 56.6% on SourcePulse

GitHubView on GitHub
Project Summary

LitData is an open-source Python library designed to accelerate AI model training by optimizing and streaming datasets. It targets ML engineers and researchers working with large datasets, enabling faster data processing, efficient cloud data utilization, and seamless integration with popular ML frameworks like PyTorch Lightning.

How It Works

LitData offers two primary modes: optimize for transforming datasets into a highly efficient, chunked binary format, and map for parallelizing data processing tasks across multiple machines. The optimize process significantly speeds up data loading (up to 20x faster than non-optimized data) by preparing data for direct streaming from cloud storage (S3, GCS, Azure) without local downloads. This approach leverages parallel processing and efficient data serialization to minimize I/O bottlenecks during training.

Quick Start & Requirements

  • Install: pip install litdata or pip install 'litdata[extras]' for all features.
  • Prerequisites: Python, cloud storage access (S3, GCS, Azure), optional s5cmd for S3.
  • Links: Quick start, Docs, Benchmarks, Templates.

Highlighted Details

  • Achieves up to 20x faster streaming speeds compared to non-optimized data and 2x over other streaming solutions like WebDataset and MosaicML.
  • Optimizes datasets 3-5x faster than competing frameworks.
  • Supports direct streaming of Hugging Face datasets and Parquet files.
  • Enables distributed data processing and optimization across multiple nodes.
  • Features include resumable streaming, data encryption, dataset merging/splitting, and flexible caching.

Maintenance & Community

LitData is an active community project with maintainers from Lightning AI. Support and discussion are available via their Discord server.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Compatible with commercial and closed-source applications.

Limitations & Caveats

  • Hugging Face dataset streaming is currently limited to datasets in Parquet format.
  • The optimize function requires data to be processed into a specific chunked binary format, which may involve an initial conversion step.
Health Check
Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
9
Issues (30d)
3
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
20 more.

alpa by alpa-projects

0.0%
3k
Auto-parallelization framework for large-scale neural network training and serving
Created 4 years ago
Updated 2 years ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
26 more.

datasets by huggingface

0.1%
21k
Access and process large AI datasets efficiently
Created 5 years ago
Updated 2 days ago
Feedback? Help us improve.