refinery by code-kern-ai

Open-source tool for NLP data scaling, assessment, and maintenance

Created 3 years ago

1,470 stars

Top 27.5% on SourcePulse

View on GitHub

5 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Cofounder of Langflow

and 1 more!

Project Summary

Refinery is an open-source platform designed for data scientists to scale, assess, and maintain natural language processing (NLP) training data. It addresses the challenges of managing unstructured text data, enabling a data-centric approach to building better NLP models by semi-automating labeling, identifying low-quality data subsets, and monitoring data quality.

How It Works

Refinery employs a microservices architecture, integrating with libraries like Hugging Face Transformers and spaCy for NLP tasks and Qdrant for neural search. It supports a data-centric workflow by allowing users to define heuristics (e.g., Python functions, active learning models, zero-shot classifiers) to generate noisy labels. These heuristics, combined with manually labeled data, form a noisy label matrix used for analysis, quality assessment, and iterative model improvement.

Quick Start & Requirements

Install via pip: pip install kern-refinery
Run locally: refinery start
Prerequisites: Docker, Python. The system automatically clones necessary repositories.
Access: http://localhost:4455
Documentation: https://docs.refinery.bio/
Demo: https://refinery.bio/

Highlighted Details

Semi-automated labeling workflow with manual and programmatic options.
Neural search for retrieving similar records and identifying outliers.
Data management features include filtering, sorting, searching, and project metrics visualization.
Python SDK for programmatic data import/export and integration with tools like Rasa.
Integrates with Hugging Face for embeddings and spaCy for tokenization.

Maintenance & Community

Active development by Kern AI.
Community support via Discord.
Newsletter and social media presence on Twitter and LinkedIn.
Contributions are welcomed via feedback and code.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0.
Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The open-source version is primarily single-user; multi-user capabilities and enterprise features are part of commercial offerings. Integrating custom Python libraries into the labeling function execution environment requires opening an issue for inclusion.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days