awesome-open-data-annotation  by zenml-io

Curated list of open-source data annotation/labeling tools

Created 3 years ago
641 stars

Top 51.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository is a curated list of open-source data annotation and labeling tools, categorized by data modality (text, images, audio, video, time series, multi-modal). It aims to help machine learning practitioners discover and evaluate tools that fit their MLOps workflows, particularly for data-centric approaches.

How It Works

The project functions as a community-driven directory, compiling tools based on three core criteria: open-source license, active maintenance, and fitness for purpose. It provides a structured overview of available tools, facilitating discovery and comparison for users involved in data annotation and labeling.

Quick Start & Requirements

This is a curated list, not a software package. To use the tools, refer to their individual project pages.

Highlighted Details

  • Comprehensive coverage across multiple data types including text, images, audio, video, time series, and multi-modal data.
  • Tools range from simple Jupyter notebook widgets to full-fledged web platforms.
  • Licenses vary, including permissive (MIT, Apache-2, BSD) and copyleft (GPL, AGPL) options.
  • Includes tools with AI-assisted labeling capabilities.

Maintenance & Community

The list is maintained by ZenML and welcomes community contributions via Pull Requests. Users are encouraged to join the ZenML Slack for discussions and potential collaborations on MLOps integrations.

Licensing & Compatibility

The repository itself is not licensed as software. The tools listed have various licenses, including Apache-2, MIT, BSD, GPL-3, AGPL-3, ELv2, Custom, and Unknown. Compatibility for commercial use depends on the specific license of each tool.

Limitations & Caveats

The list's quality and completeness depend on community contributions. Some tools have "Unknown" or "N/A" licenses, and the "active maintenance" status may vary. The "Description" field is brief, requiring users to visit individual project pages for detailed functionality.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

Curator by NVIDIA-NeMo

1.3%
1k
Data curation toolkit for LLMs
Created 1 year ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Wing Lian Wing Lian(Founder of Axolotl AI).

xtreme1 by xtreme1-io

0.5%
1k
Open-source platform for multimodal training data annotation
Created 3 years ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alex Atallah Alex Atallah(Cofounder of OpenRouter), and
8 more.

cleanlab by cleanlab

0.2%
11k
Data-centric AI package for ML with messy data
Created 7 years ago
Updated 1 week ago
Feedback? Help us improve.