croissant by mlcommons

Metadata format for ML datasets (research paper)

Created 2 years ago

796 stars

Top 44.2% on SourcePulse

View on GitHub

4 Experts Love This Project

Simon Willison

Coauthor of Django

Jeff Hammerbacher

Cofounder of Cloudera

Omar Sanseviero

DevRel at Google DeepMind

Julien Chaumond

Cofounder of Hugging Face

Project Summary

Croissant is a high-level metadata format designed to standardize the description and accessibility of machine learning datasets. It aims to simplify dataset discovery, usage, and tool support by integrating metadata, resource descriptions, data structure, and ML semantics into a single, web-friendly file, building upon schema.org.

How It Works

Croissant defines a JSON-LD structure that encapsulates four key layers: Metadata (dataset description, responsible AI aspects), Resources (raw data file locations and properties), Structure (how data is organized and fields are defined), and ML Semantics (common ML usage patterns). This layered approach, grounded in schema.org, allows for rich, machine-readable descriptions that enhance discoverability and interoperability across ML tools and platforms.

Quick Start & Requirements

Installation: pip install mlcroissant
Requirements: Python 3.10+
Example Usage: Load and inspect a dataset via mlcroissant library, integrate with tensorflow_datasets using CroissantBuilder.
Documentation: https://huggingface.co/docs/croissant/index

Highlighted Details

Built on schema.org's Dataset vocabulary for web discoverability.
Integrations with Hugging Face, Kaggle, OpenML, and TensorFlow Datasets (TFDS).
Supports responsible AI metadata.
Community-driven development under MLCommons.

Maintenance & Community

Croissant is a community-driven project under the MLCommons Association Datasets Working Group. Key contributors include individuals from Hugging Face, Google, Meta, Kaggle, and academic institutions. Community engagement is encouraged via mailing lists and issue tracking.

Licensing & Compatibility

The Croissant project code and examples are licensed under Apache 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Croissant is under active development, and while the core format is stable, specific tool integrations or advanced features might still be evolving. The README does not detail specific performance benchmarks or known limitations beyond its developmental status.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days