Open-source tool for NLP data scaling, assessment, and maintenance
Top 28.8% on sourcepulse
Refinery is an open-source platform designed for data scientists to scale, assess, and maintain natural language processing (NLP) training data. It addresses the challenges of managing unstructured text data, enabling a data-centric approach to building better NLP models by semi-automating labeling, identifying low-quality data subsets, and monitoring data quality.
How It Works
Refinery employs a microservices architecture, integrating with libraries like Hugging Face Transformers and spaCy for NLP tasks and Qdrant for neural search. It supports a data-centric workflow by allowing users to define heuristics (e.g., Python functions, active learning models, zero-shot classifiers) to generate noisy labels. These heuristics, combined with manually labeled data, form a noisy label matrix used for analysis, quality assessment, and iterative model improvement.
Quick Start & Requirements
pip install kern-refinery
refinery start
http://localhost:4455
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The open-source version is primarily single-user; multi-user capabilities and enterprise features are part of commercial offerings. Integrating custom Python libraries into the labeling function execution environment requires opening an issue for inclusion.
7 months ago
1 day