Curated list of open-source tools for data-centric AI on unstructured data
Top 48.7% on sourcepulse
This repository is a curated list of open-source tooling for Data-Centric AI (DCAI) workflows, specifically focusing on unstructured data. It aims to help practitioners discover and utilize tools for systematically engineering datasets to build more robust and valuable AI systems, targeting ML engineers and data scientists.
How It Works
The list categorizes tools across key DCAI stages: Data Versioning, Embeddings & Pre-trained Models, Visualization & Interaction, Outlier/Noise Detection, Explainability, Active Learning, Uncertainty Quantification, Bias & Fairness, Observability & Monitoring, Augmentation & Synthetic Data, and Security & Robustness. It also includes a "Data-centric AI playbook" with workflow snippets demonstrating how to solve common tasks using these tools.
Quick Start & Requirements
This is a curated list, not a runnable project. To use the tools, refer to their individual project pages. The README provides links to the tools themselves and a "Further reading" section for related topics like tabular data DCAI, labeling tools, MLOps, and research papers.
Highlighted Details
Maintenance & Community
The list is maintained by Renumics and encourages community contributions via pull requests or direct contact. Links to external "awesome lists" for related topics are provided.
Licensing & Compatibility
The README lists various open-source licenses for the tools included (e.g., MIT, Apache 2.0). Specific licensing details and compatibility for commercial use must be checked for each individual tool.
Limitations & Caveats
This is a curated list and not a unified framework; users must integrate and manage individual tools. Popularity metrics and specific license details for each tool are not consistently provided within the list itself.
1 year ago
1 week