Data-centric AI platform for detecting/rectifying data issues
Top 14.6% on sourcepulse
Docta is a data-centric AI platform designed to detect and rectify issues in datasets, aiming to improve model performance. It supports tabular, text, and image data, offering automated diagnosis, curation, and "nutrition" services. The open-source tool is training-free and targets users seeking to enhance data quality without extensive manual effort or computational cost.
How It Works
Docta employs a "training-free" approach to data diagnosis and correction. For LLM alignment data, it extracts embeddings from conversations and uses these representations to identify potentially mislabeled or harmful content, assigning a suggest_rating
. This method aims to efficiently surface annotation errors and problematic patterns without requiring model retraining.
Quick Start & Requirements
pip install docta.ai
data_root
as specified in configuration../demo/
for LLM alignment data (red teaming, harmlessness), image data, and tabular data.Highlighted Details
suggest_rating
to instances.Maintenance & Community
contact@docta.ai
for commercial inquiries and full version requests.Licensing & Compatibility
contact@docta.ai
.Limitations & Caveats
The open-source version provides demos and sampled results; a full version requires contacting the developers. The effectiveness of "cured" data is presented as sampled results and should be used at the user's discretion.
6 months ago
1 day