datasets by huggingface

Access and process large AI datasets efficiently

Created 6 years ago

21,272 stars

Top 2.2% on SourcePulse

View on GitHub

28 Experts Love This Project

Clement Delangue

Cofounder of Hugging Face

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Travis Fischer

Founder of Agentic

Georgios Konstantopoulos

CTO, General Partner at Paradigm

and 24 more!

Project Summary

🤗 Datasets is a Python library providing easy access to a vast collection of AI datasets and efficient data manipulation tools. It targets ML practitioners and researchers, enabling quick loading, preprocessing, and integration of diverse datasets (text, image, audio) with popular ML frameworks.

How It Works

The library leverages Apache Arrow for memory-mapping, allowing it to handle datasets larger than RAM. It features smart caching for processed data and supports streaming for immediate iteration and reduced disk usage. Its API is designed for simplicity and efficiency, integrating seamlessly with NumPy, Pandas, PyTorch, TensorFlow, and JAX.

Quick Start & Requirements

Install with pip: pip install datasets
For ML framework integration, install PyTorch (2.0+), TensorFlow (2.6+), or JAX (3.14+).
Documentation: https://huggingface.co/docs/datasets/installation
Quickstart guide: https://huggingface.co/docs/datasets/quickstart

Highlighted Details

Access to over 650 datasets on the Hugging Face Hub.
Supports 467 languages and dialects.
Native support for audio, image, and video data.
Memory-mapping via Apache Arrow for large datasets.
Streaming mode for disk space saving and immediate iteration.

Maintenance & Community

The library has over 250 contributors and is actively maintained by the Hugging Face team. Community discussions and dataset sharing are encouraged via the Hugging Face Hub.

Licensing & Compatibility

The library is distributed under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Dataset reproducibility relies on users pinning specific repository revisions. While the library aims for broad compatibility, specific dataset preprocessing scripts might require additional dependencies.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

113 stars in the last 30 days