docta  by Docta-ai

Data-centric AI platform for detecting/rectifying data issues

Created 2 years ago
3,465 stars

Top 14.0% on SourcePulse

GitHubView on GitHub
Project Summary

Docta is a data-centric AI platform designed to detect and rectify issues in datasets, aiming to improve model performance. It supports tabular, text, and image data, offering automated diagnosis, curation, and "nutrition" services. The open-source tool is training-free and targets users seeking to enhance data quality without extensive manual effort or computational cost.

How It Works

Docta employs a "training-free" approach to data diagnosis and correction. For LLM alignment data, it extracts embeddings from conversations and uses these representations to identify potentially mislabeled or harmful content, assigning a suggest_rating. This method aims to efficiently surface annotation errors and problematic patterns without requiring model retraining.

Quick Start & Requirements

  • Install: pip install docta.ai
  • Requirements: GPU strongly recommended for feature encoding. Python environment.
  • Setup: Download datasets to data_root as specified in configuration.
  • Demos: Jupyter notebooks available in ./demo/ for LLM alignment data (red teaming, harmlessness), image data, and tabular data.
  • Docs: https://github.com/Docta-ai/docta

Highlighted Details

  • Detects label errors in LLM alignment data, reporting noise rates (e.g., ~8% in red teaming attempts, ~28% in harmless base comparisons).
  • Supports diagnosis of image data (CIFAR-N) and tabular data for label errors and rare patterns.
  • Offers "training-free" detection of rare patterns in image datasets.
  • Provides a mechanism to "cure" data by adding suggest_rating to instances.

Maintenance & Community

  • Contact: contact@docta.ai for commercial inquiries and full version requests.
  • Citation: Available via arXiv preprint.

Licensing & Compatibility

  • License: Creative Commons Attribution-NonCommercial 4.0.
  • Commercial Use: Requires contacting contact@docta.ai.

Limitations & Caveats

The open-source version provides demos and sampled results; a full version requires contacting the developers. The effectiveness of "cured" data is presented as sampled results and should be used at the user's discretion.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
49 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alex Atallah Alex Atallah(Cofounder of OpenRouter), and
8 more.

cleanlab by cleanlab

0.2%
11k
Data-centric AI package for ML with messy data
Created 7 years ago
Updated 1 week ago
Feedback? Help us improve.