cleanlab  by cleanlab

Data-centric AI package for ML with messy data

created 7 years ago
10,753 stars

Top 4.8% on sourcepulse

GitHubView on GitHub
Project Summary

Cleanlab is a data-centric AI package designed to automatically detect and fix issues in machine learning datasets, particularly those with messy, real-world data and labels. It empowers users to improve model reliability across various ML tasks, including supervised learning, LLMs, and RAG applications, by leveraging existing models to identify problems like outliers, duplicates, and label errors.

How It Works

Cleanlab employs state-of-the-art confident learning algorithms, grounded in peer-reviewed research, to estimate dataset problems. It works by using an existing ML model's predictions and embeddings to diagnose issues within the data. This approach is advantageous as it requires no changes to existing modeling code and can be applied universally across any dataset type (text, image, audio, tabular) and any ML model (PyTorch, TensorFlow, XGBoost, etc.).

Quick Start & Requirements

  • Install: pip install cleanlab or conda install cleanlab
  • Prerequisites: Python 3.8+
  • Resources: No specific hardware requirements mentioned, but performance may depend on dataset size and model complexity.
  • Documentation: Documentation, Examples

Highlighted Details

  • Detects a wide range of data issues: outliers, duplicates, label errors, multi-annotator quality, and supports active learning.
  • Model-agnostic: Works with any ML model and dataset type.
  • Theoretically backed with provable guarantees for label noise estimation.
  • Offers dedicated functionality for various ML tasks including classification, token classification, regression, image segmentation, and object detection.

Maintenance & Community

  • Active community with a Slack channel for discussion and contributions.
  • Regularly publishes research and blog posts on data-centric AI.
  • Community

Licensing & Compatibility

  • Licensed under GNU Affero General Public License v3 or later.
  • Commercial licensing is available upon request.

Limitations & Caveats

  • The open-source version requires users to have an existing ML model and an interface to fix identified issues.
  • While powerful, the effectiveness of issue detection is dependent on the quality of the initial ML model used.
Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
1
Star History
284 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Andre Zayarni Andre Zayarni(Cofounder of Qdrant), and
1 more.

refinery by code-kern-ai

0.1%
1k
Open-source tool for NLP data scaling, assessment, and maintenance
created 3 years ago
updated 7 months ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Feedback? Help us improve.