docta  by Docta-ai

Data-centric AI platform for detecting/rectifying data issues

created 2 years ago
3,397 stars

Top 14.6% on sourcepulse

GitHubView on GitHub
Project Summary

Docta is a data-centric AI platform designed to detect and rectify issues in datasets, aiming to improve model performance. It supports tabular, text, and image data, offering automated diagnosis, curation, and "nutrition" services. The open-source tool is training-free and targets users seeking to enhance data quality without extensive manual effort or computational cost.

How It Works

Docta employs a "training-free" approach to data diagnosis and correction. For LLM alignment data, it extracts embeddings from conversations and uses these representations to identify potentially mislabeled or harmful content, assigning a suggest_rating. This method aims to efficiently surface annotation errors and problematic patterns without requiring model retraining.

Quick Start & Requirements

  • Install: pip install docta.ai
  • Requirements: GPU strongly recommended for feature encoding. Python environment.
  • Setup: Download datasets to data_root as specified in configuration.
  • Demos: Jupyter notebooks available in ./demo/ for LLM alignment data (red teaming, harmlessness), image data, and tabular data.
  • Docs: https://github.com/Docta-ai/docta

Highlighted Details

  • Detects label errors in LLM alignment data, reporting noise rates (e.g., ~8% in red teaming attempts, ~28% in harmless base comparisons).
  • Supports diagnosis of image data (CIFAR-N) and tabular data for label errors and rare patterns.
  • Offers "training-free" detection of rare patterns in image datasets.
  • Provides a mechanism to "cure" data by adding suggest_rating to instances.

Maintenance & Community

  • Contact: contact@docta.ai for commercial inquiries and full version requests.
  • Citation: Available via arXiv preprint.

Licensing & Compatibility

  • License: Creative Commons Attribution-NonCommercial 4.0.
  • Commercial Use: Requires contacting contact@docta.ai.

Limitations & Caveats

The open-source version provides demos and sampled results; a full version requires contacting the developers. The effectiveness of "cured" data is presented as sampled results and should be used at the user's discretion.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
158 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

cleanlab by cleanlab

0.2%
11k
Data-centric AI package for ML with messy data
created 7 years ago
updated 3 weeks ago
Feedback? Help us improve.