datasets  by huggingface

Access and process large AI datasets efficiently

Created 5 years ago
20,733 stars

Top 2.1% on SourcePulse

GitHubView on GitHub
Project Summary

🤗 Datasets is a Python library providing easy access to a vast collection of AI datasets and efficient data manipulation tools. It targets ML practitioners and researchers, enabling quick loading, preprocessing, and integration of diverse datasets (text, image, audio) with popular ML frameworks.

How It Works

The library leverages Apache Arrow for memory-mapping, allowing it to handle datasets larger than RAM. It features smart caching for processed data and supports streaming for immediate iteration and reduced disk usage. Its API is designed for simplicity and efficiency, integrating seamlessly with NumPy, Pandas, PyTorch, TensorFlow, and JAX.

Quick Start & Requirements

Highlighted Details

  • Access to over 650 datasets on the Hugging Face Hub.
  • Supports 467 languages and dialects.
  • Native support for audio, image, and video data.
  • Memory-mapping via Apache Arrow for large datasets.
  • Streaming mode for disk space saving and immediate iteration.

Maintenance & Community

The library has over 250 contributors and is actively maintained by the Hugging Face team. Community discussions and dataset sharing are encouraged via the Hugging Face Hub.

Licensing & Compatibility

The library is distributed under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Dataset reproducibility relies on users pinning specific repository revisions. While the library aims for broad compatibility, specific dataset preprocessing scripts might require additional dependencies.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
34
Issues (30d)
16
Star History
117 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Hanlin Tang Hanlin Tang(CTO Neural Networks at Databricks; Cofounder of MosaicML), and
1 more.

diffusion by mosaicml

0.1%
709
Diffusion model training code
Created 2 years ago
Updated 9 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

towhee by towhee-io

0%
3k
Framework for neural data processing pipelines
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.