datasets  by huggingface

Access and process large AI datasets efficiently

created 5 years ago
20,507 stars

Top 2.1% on SourcePulse

GitHubView on GitHub
Project Summary

🤗 Datasets is a Python library providing easy access to a vast collection of AI datasets and efficient data manipulation tools. It targets ML practitioners and researchers, enabling quick loading, preprocessing, and integration of diverse datasets (text, image, audio) with popular ML frameworks.

How It Works

The library leverages Apache Arrow for memory-mapping, allowing it to handle datasets larger than RAM. It features smart caching for processed data and supports streaming for immediate iteration and reduced disk usage. Its API is designed for simplicity and efficiency, integrating seamlessly with NumPy, Pandas, PyTorch, TensorFlow, and JAX.

Quick Start & Requirements

Highlighted Details

  • Access to over 650 datasets on the Hugging Face Hub.
  • Supports 467 languages and dialects.
  • Native support for audio, image, and video data.
  • Memory-mapping via Apache Arrow for large datasets.
  • Streaming mode for disk space saving and immediate iteration.

Maintenance & Community

The library has over 250 contributors and is actively maintained by the Hugging Face team. Community discussions and dataset sharing are encouraged via the Hugging Face Hub.

Licensing & Compatibility

The library is distributed under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Dataset reproducibility relies on users pinning specific repository revisions. While the library aims for broad compatibility, specific dataset preprocessing scripts might require additional dependencies.

Health Check
Last commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
21
Issues (30d)
29
Star History
145 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

argilla by argilla-io

0.2%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 4 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anton Troynikov Anton Troynikov(Cofounder of Chroma), and
30 more.

llama_index by run-llama

0.3%
44k
Data framework for building LLM-powered agents
created 2 years ago
updated 22 hours ago
Feedback? Help us improve.