refinery  by code-kern-ai

Open-source tool for NLP data scaling, assessment, and maintenance

Created 3 years ago
1,459 stars

Top 28.1% on SourcePulse

GitHubView on GitHub
Project Summary

Refinery is an open-source platform designed for data scientists to scale, assess, and maintain natural language processing (NLP) training data. It addresses the challenges of managing unstructured text data, enabling a data-centric approach to building better NLP models by semi-automating labeling, identifying low-quality data subsets, and monitoring data quality.

How It Works

Refinery employs a microservices architecture, integrating with libraries like Hugging Face Transformers and spaCy for NLP tasks and Qdrant for neural search. It supports a data-centric workflow by allowing users to define heuristics (e.g., Python functions, active learning models, zero-shot classifiers) to generate noisy labels. These heuristics, combined with manually labeled data, form a noisy label matrix used for analysis, quality assessment, and iterative model improvement.

Quick Start & Requirements

  • Install via pip: pip install kern-refinery
  • Run locally: refinery start
  • Prerequisites: Docker, Python. The system automatically clones necessary repositories.
  • Access: http://localhost:4455
  • Documentation: https://docs.refinery.bio/
  • Demo: https://refinery.bio/

Highlighted Details

  • Semi-automated labeling workflow with manual and programmatic options.
  • Neural search for retrieving similar records and identifying outliers.
  • Data management features include filtering, sorting, searching, and project metrics visualization.
  • Python SDK for programmatic data import/export and integration with tools like Rasa.
  • Integrates with Hugging Face for embeddings and spaCy for tokenization.

Maintenance & Community

  • Active development by Kern AI.
  • Community support via Discord.
  • Newsletter and social media presence on Twitter and LinkedIn.
  • Contributions are welcomed via feedback and code.

Licensing & Compatibility

  • Licensed under the Apache License, Version 2.0.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The open-source version is primarily single-user; multi-user capabilities and enterprise features are part of commercial offerings. Integrating custom Python libraries into the labeling function execution environment requires opening an issue for inclusion.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), François Chollet François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), and
43 more.

spaCy by explosion

0.1%
33k
NLP library for production applications
Created 11 years ago
Updated 1 week ago
Feedback? Help us improve.