litData  by Lightning-AI

SDK for scaling data transforms and optimizing datasets for fast AI training

created 1 year ago
516 stars

Top 61.6% on sourcepulse

GitHubView on GitHub
Project Summary

LitData is an open-source Python library designed to accelerate AI model training by optimizing and streaming datasets. It targets ML engineers and researchers working with large datasets, enabling faster data processing, efficient cloud data utilization, and seamless integration with popular ML frameworks like PyTorch Lightning.

How It Works

LitData offers two primary modes: optimize for transforming datasets into a highly efficient, chunked binary format, and map for parallelizing data processing tasks across multiple machines. The optimize process significantly speeds up data loading (up to 20x faster than non-optimized data) by preparing data for direct streaming from cloud storage (S3, GCS, Azure) without local downloads. This approach leverages parallel processing and efficient data serialization to minimize I/O bottlenecks during training.

Quick Start & Requirements

  • Install: pip install litdata or pip install 'litdata[extras]' for all features.
  • Prerequisites: Python, cloud storage access (S3, GCS, Azure), optional s5cmd for S3.
  • Links: Quick start, Docs, Benchmarks, Templates.

Highlighted Details

  • Achieves up to 20x faster streaming speeds compared to non-optimized data and 2x over other streaming solutions like WebDataset and MosaicML.
  • Optimizes datasets 3-5x faster than competing frameworks.
  • Supports direct streaming of Hugging Face datasets and Parquet files.
  • Enables distributed data processing and optimization across multiple nodes.
  • Features include resumable streaming, data encryption, dataset merging/splitting, and flexible caching.

Maintenance & Community

LitData is an active community project with maintainers from Lightning AI. Support and discussion are available via their Discord server.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Compatible with commercial and closed-source applications.

Limitations & Caveats

  • Hugging Face dataset streaming is currently limited to datasets in Parquet format.
  • The optimize function requires data to be processed into a specific chunked binary format, which may involve an initial conversion step.
Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
18
Issues (30d)
14
Star History
55 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alexander Wettig Alexander Wettig(Author of SWE-bench, SWE-agent), and
2 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
created 2 years ago
updated 1 day ago
Feedback? Help us improve.