grain  by google

Python library for ML training data pipelines

Created 3 years ago
536 stars

Top 59.4% on SourcePulse

GitHubView on GitHub
Project Summary

Grain is a Python library for efficient, deterministic, and flexible reading and processing of machine learning training data, primarily targeting JAX models but usable with other frameworks. It enables users to define complex data pipelines declaratively, simplifying the preparation of datasets for training and evaluation.

How It Works

Grain employs a declarative API for defining data processing pipelines. Users chain transformations like shuffle, map, and batch to construct a data flow. This approach allows for clear, readable pipeline definitions and enables Grain to optimize the execution of these steps, ensuring deterministic and efficient data handling.

Quick Start & Requirements

Highlighted Details

  • Designed for JAX models but framework-agnostic.
  • Supports global shuffling and element-wise mapping.
  • Used internally by Google projects like MaxText and Gemma.

Maintenance & Community

  • Developed by Google.
  • No explicit community links (Discord/Slack) or roadmap provided in the README.

Licensing & Compatibility

  • Apache License 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The library does not directly utilize GPUs or TPUs for its transformations, meaning all processing is CPU-bound. Windows is not a supported platform.

Health Check
Last Commit

21 hours ago

Responsiveness

1+ week

Pull Requests (30d)
48
Issues (30d)
7
Star History
33 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
11 more.

datatrove by huggingface

0.9%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alexander Wettig Alexander Wettig(Coauthor of SWE-bench, SWE-agent), and
5 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
Created 2 years ago
Updated 23 hours ago
Feedback? Help us improve.