croissant  by mlcommons

Metadata format for ML datasets (research paper)

created 2 years ago
679 stars

Top 50.9% on sourcepulse

GitHubView on GitHub
Project Summary

Croissant is a high-level metadata format designed to standardize the description and accessibility of machine learning datasets. It aims to simplify dataset discovery, usage, and tool support by integrating metadata, resource descriptions, data structure, and ML semantics into a single, web-friendly file, building upon schema.org.

How It Works

Croissant defines a JSON-LD structure that encapsulates four key layers: Metadata (dataset description, responsible AI aspects), Resources (raw data file locations and properties), Structure (how data is organized and fields are defined), and ML Semantics (common ML usage patterns). This layered approach, grounded in schema.org, allows for rich, machine-readable descriptions that enhance discoverability and interoperability across ML tools and platforms.

Quick Start & Requirements

  • Installation: pip install mlcroissant
  • Requirements: Python 3.10+
  • Example Usage: Load and inspect a dataset via mlcroissant library, integrate with tensorflow_datasets using CroissantBuilder.
  • Documentation: https://huggingface.co/docs/croissant/index

Highlighted Details

  • Built on schema.org's Dataset vocabulary for web discoverability.
  • Integrations with Hugging Face, Kaggle, OpenML, and TensorFlow Datasets (TFDS).
  • Supports responsible AI metadata.
  • Community-driven development under MLCommons.

Maintenance & Community

Croissant is a community-driven project under the MLCommons Association Datasets Working Group. Key contributors include individuals from Hugging Face, Google, Meta, Kaggle, and academic institutions. Community engagement is encouraged via mailing lists and issue tracking.

Licensing & Compatibility

The Croissant project code and examples are licensed under Apache 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Croissant is under active development, and while the core format is stable, specific tool integrations or advanced features might still be evolving. The README does not detail specific performance benchmarks or known limitations beyond its developmental status.

Health Check
Last commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
22
Issues (30d)
8
Star History
92 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Andre Zayarni Andre Zayarni(Cofounder of Qdrant), and
1 more.

refinery by code-kern-ai

0.1%
1k
Open-source tool for NLP data scaling, assessment, and maintenance
created 3 years ago
updated 7 months ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Feedback? Help us improve.