docta by Docta-ai

Data-centric AI platform for detecting/rectifying data issues

Created 2 years ago

3,489 stars

Top 13.8% on SourcePulse

Project Summary

Docta is a data-centric AI platform designed to detect and rectify issues in datasets, aiming to improve model performance. It supports tabular, text, and image data, offering automated diagnosis, curation, and "nutrition" services. The open-source tool is training-free and targets users seeking to enhance data quality without extensive manual effort or computational cost.

How It Works

Docta employs a "training-free" approach to data diagnosis and correction. For LLM alignment data, it extracts embeddings from conversations and uses these representations to identify potentially mislabeled or harmful content, assigning a suggest_rating. This method aims to efficiently surface annotation errors and problematic patterns without requiring model retraining.

Quick Start & Requirements

Install: pip install docta.ai
Requirements: GPU strongly recommended for feature encoding. Python environment.
Setup: Download datasets to data_root as specified in configuration.
Demos: Jupyter notebooks available in ./demo/ for LLM alignment data (red teaming, harmlessness), image data, and tabular data.
Docs: https://github.com/Docta-ai/docta

Highlighted Details

Detects label errors in LLM alignment data, reporting noise rates (e.g., ~8% in red teaming attempts, ~28% in harmless base comparisons).
Supports diagnosis of image data (CIFAR-N) and tabular data for label errors and rare patterns.
Offers "training-free" detection of rare patterns in image datasets.
Provides a mechanism to "cure" data by adding suggest_rating to instances.

Maintenance & Community

Contact: contact@docta.ai for commercial inquiries and full version requests.
Citation: Available via arXiv preprint.

Licensing & Compatibility

License: Creative Commons Attribution-NonCommercial 4.0.
Commercial Use: Requires contacting contact@docta.ai.

Limitations & Caveats

The open-source version provides demos and sampled results; a full version requires contacting the developers. The effectiveness of "cured" data is presented as sampled results and should be used at the user's discretion.

docta by Docta-ai

Explore Similar Projects

Open-Qwen2VL by Victorwz

galactic by taylorai

upgini by upgini

awesome-data-centric-ai by Data-Centric-AI-Community

TabLLM by clinicalml

awesome-open-data-annotation by zenml-io

Streamline-Analyst by Wilson-ZheLin

data-centric-AI by daochenzha

AnomalyGPT by CASIA-LMC-Lab

dcai-lab by dcai-course

AutoDL by DeepWisdom

cleanlab by cleanlab