awesome-open-data-centric-ai  by Renumics

Curated list of open-source tools for data-centric AI on unstructured data

created 2 years ago
721 stars

Top 48.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository is a curated list of open-source tooling for Data-Centric AI (DCAI) workflows, specifically focusing on unstructured data. It aims to help practitioners discover and utilize tools for systematically engineering datasets to build more robust and valuable AI systems, targeting ML engineers and data scientists.

How It Works

The list categorizes tools across key DCAI stages: Data Versioning, Embeddings & Pre-trained Models, Visualization & Interaction, Outlier/Noise Detection, Explainability, Active Learning, Uncertainty Quantification, Bias & Fairness, Observability & Monitoring, Augmentation & Synthetic Data, and Security & Robustness. It also includes a "Data-centric AI playbook" with workflow snippets demonstrating how to solve common tasks using these tools.

Quick Start & Requirements

This is a curated list, not a runnable project. To use the tools, refer to their individual project pages. The README provides links to the tools themselves and a "Further reading" section for related topics like tabular data DCAI, labeling tools, MLOps, and research papers.

Highlighted Details

  • Covers a broad spectrum of DCAI tasks, from data versioning (DVC, Deep Lake) to model explainability (SHAP, LIME) and security (CleverHans, LLM-Guard).
  • Includes specific tools for unstructured data types like images, audio, video, time-series, and text.
  • Features a "Data-centric AI playbook" with practical workflow examples and notebook links for common tasks like outlier detection and label inconsistency identification.
  • Excludes tabular data tools, dedicated labeling tools, and MLOps tooling to maintain focus on unstructured data DCAI.

Maintenance & Community

The list is maintained by Renumics and encourages community contributions via pull requests or direct contact. Links to external "awesome lists" for related topics are provided.

Licensing & Compatibility

The README lists various open-source licenses for the tools included (e.g., MIT, Apache 2.0). Specific licensing details and compatibility for commercial use must be checked for each individual tool.

Limitations & Caveats

This is a curated list and not a unified framework; users must integrate and manage individual tools. Popularity metrics and specific license details for each tool are not consistently provided within the list itself.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.