refinery  by code-kern-ai

Open-source tool for NLP data scaling, assessment, and maintenance

created 3 years ago
1,452 stars

Top 28.8% on sourcepulse

GitHubView on GitHub
Project Summary

Refinery is an open-source platform designed for data scientists to scale, assess, and maintain natural language processing (NLP) training data. It addresses the challenges of managing unstructured text data, enabling a data-centric approach to building better NLP models by semi-automating labeling, identifying low-quality data subsets, and monitoring data quality.

How It Works

Refinery employs a microservices architecture, integrating with libraries like Hugging Face Transformers and spaCy for NLP tasks and Qdrant for neural search. It supports a data-centric workflow by allowing users to define heuristics (e.g., Python functions, active learning models, zero-shot classifiers) to generate noisy labels. These heuristics, combined with manually labeled data, form a noisy label matrix used for analysis, quality assessment, and iterative model improvement.

Quick Start & Requirements

  • Install via pip: pip install kern-refinery
  • Run locally: refinery start
  • Prerequisites: Docker, Python. The system automatically clones necessary repositories.
  • Access: http://localhost:4455
  • Documentation: https://docs.refinery.bio/
  • Demo: https://refinery.bio/

Highlighted Details

  • Semi-automated labeling workflow with manual and programmatic options.
  • Neural search for retrieving similar records and identifying outliers.
  • Data management features include filtering, sorting, searching, and project metrics visualization.
  • Python SDK for programmatic data import/export and integration with tools like Rasa.
  • Integrates with Hugging Face for embeddings and spaCy for tokenization.

Maintenance & Community

  • Active development by Kern AI.
  • Community support via Discord.
  • Newsletter and social media presence on Twitter and LinkedIn.
  • Contributions are welcomed via feedback and code.

Licensing & Compatibility

  • Licensed under the Apache License, Version 2.0.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The open-source version is primarily single-user; multi-user capabilities and enterprise features are part of commercial offerings. Integrating custom Python libraries into the labeling function execution environment requires opening an issue for inclusion.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

autolabel by refuel-ai

0.3%
2k
Python library to label text datasets using LLMs
created 2 years ago
updated 5 months ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
5 more.

pattern by clips

0.0%
9k
Python web mining module
created 14 years ago
updated 1 year ago
Feedback? Help us improve.