grain  by google

Python library for ML training data pipelines

created 3 years ago
486 stars

Top 64.2% on sourcepulse

GitHubView on GitHub
Project Summary

Grain is a Python library for efficient, deterministic, and flexible reading and processing of machine learning training data, primarily targeting JAX models but usable with other frameworks. It enables users to define complex data pipelines declaratively, simplifying the preparation of datasets for training and evaluation.

How It Works

Grain employs a declarative API for defining data processing pipelines. Users chain transformations like shuffle, map, and batch to construct a data flow. This approach allows for clear, readable pipeline definitions and enables Grain to optimize the execution of these steps, ensuring deterministic and efficient data handling.

Quick Start & Requirements

Highlighted Details

  • Designed for JAX models but framework-agnostic.
  • Supports global shuffling and element-wise mapping.
  • Used internally by Google projects like MaxText and Gemma.

Maintenance & Community

  • Developed by Google.
  • No explicit community links (Discord/Slack) or roadmap provided in the README.

Licensing & Compatibility

  • Apache License 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The library does not directly utilize GPUs or TPUs for its transformations, meaning all processing is CPU-bound. Windows is not a supported platform.

Health Check
Last commit

1 day ago

Responsiveness

1+ week

Pull Requests (30d)
41
Issues (30d)
3
Star History
61 stars in the last 90 days

Explore Similar Projects

Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Anton Bukov Anton Bukov(Cofounder of 1inch Network), and
16 more.

tinygrad by tinygrad

0.1%
30k
Minimalist deep learning framework for education and exploration
created 4 years ago
updated 18 hours ago
Feedback? Help us improve.