grain by google

Python library for ML training data pipelines

Created 3 years ago

684 stars

Top 49.6% on SourcePulse

4 Experts Love This Project

wesm

Author of Pandas

Edward-Sun

Research Scientist at Meta Superintelligence Lab

lucidrains

Prolific Research Paper Implementer

hammer

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Grain is a Python library for efficient, deterministic, and flexible reading and processing of machine learning training data, primarily targeting JAX models but usable with other frameworks. It enables users to define complex data pipelines declaratively, simplifying the preparation of datasets for training and evaluation.

How It Works

Grain employs a declarative API for defining data processing pipelines. Users chain transformations like shuffle, map, and batch to construct a data flow. This approach allows for clear, readable pipeline definitions and enables Grain to optimize the execution of these steps, ensuring deterministic and efficient data handling.

Quick Start & Requirements

Install with: pip install grain
Supports Linux (x86_64, aarch64) and macOS (aarch64). Windows is not supported.
Data processing occurs on the CPU.
Quickstart guide: https://github.com/google/grain#quickstart

Highlighted Details

Designed for JAX models but framework-agnostic.
Supports global shuffling and element-wise mapping.
Used internally by Google projects like MaxText and Gemma.

Maintenance & Community

Developed by Google.
No explicit community links (Discord/Slack) or roadmap provided in the README.

Licensing & Compatibility

Apache License 2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The library does not directly utilize GPUs or TPUs for its transformations, meaning all processing is CPU-bound. Windows is not a supported platform.

Health Check

Last Commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)

24

Issues (30d)

1

Star History

11 stars in the last 30 days

Explore Similar Projects

FlagData by FlagOpen

Data processing toolkit for AI model training and deployment

Created 3 years ago

Updated 1 year ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Jinze Bai

Jinze Bai(Research Scientist at Alibaba Qwen), and

3 more.

seqio by google

SeqIO: Task-based datasets, preprocessing, and evaluation library for sequence models

Created 4 years ago

Updated 3 weeks ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera) and

Zhuohan Li

Zhuohan Li(Coauthor of vLLM).

paxml by google

Jax-based ML framework for large-scale model training and experimentation

Created 3 years ago

Updated 1 week ago

Starred by

Travis Addair

Travis Addair(Cofounder of Predibase).

FEDOT by aimclub

AutoML framework for automated modeling and machine learning

Created 6 years ago

Updated 5 days ago

Starred by

Simon Willison

Simon Willison(Coauthor of Django),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

2 more.

croissant by mlcommons

Metadata format for ML datasets (research paper)

Created 2 years ago

Updated 2 weeks ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and

2 more.

dclm by mlfoundations

Framework for LLM dataset creation, training, and evaluation

Created 1 year ago

Updated 5 months ago

Starred by

Ross Taylor

Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code),

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory), and

3 more.

curator by bespokelabsai

Synthetic data curation tool for post-training and structured data extraction

Created 1 year ago

Updated 1 month ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Andrey Vasnetsov

Andrey Vasnetsov(Cofounder of Qdrant), and

2 more.

automl-gs by minimaxir

AutoML tool for generating ML/DL models and Python code from CSV data

Created 7 years ago

Updated 6 years ago

Starred by

Lewis Tunstall

Lewis Tunstall(Research Engineer at Hugging Face),

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and

12 more.

datatrove by huggingface

Data processing library for large-scale text data

Created 2 years ago

Updated 1 day ago

Starred by

Simon Willison

Simon Willison(Coauthor of Django),

Tomas Valenta

Tomas Valenta(Cofounder of E2B), and

1 more.

replicate-python by replicate

Python SDK for Replicate

Created 3 years ago

Updated 6 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Alexander Wettig

Alexander Wettig(Coauthor of SWE-bench, SWE-agent), and

5 more.

data-juicer by datajuicer

Data-Juicer: Data processing system for foundation models

Created 2 years ago

Updated 1 day ago

Starred by

Chaoyu Yang

Chaoyu Yang(Founder of Bento),

Tomas Valenta

Tomas Valenta(Cofounder of E2B), and

6 more.

BentoML by bentoml

Framework for serving AI apps and models

Created 7 years ago

Updated 22 hours ago

Feedback? Help us improve.