refinery  by code-kern-ai

Open-source tool for NLP data scaling, assessment, and maintenance

Created 3 years ago
1,455 stars

Top 28.1% on SourcePulse

GitHubView on GitHub
Project Summary

Refinery is an open-source platform designed for data scientists to scale, assess, and maintain natural language processing (NLP) training data. It addresses the challenges of managing unstructured text data, enabling a data-centric approach to building better NLP models by semi-automating labeling, identifying low-quality data subsets, and monitoring data quality.

How It Works

Refinery employs a microservices architecture, integrating with libraries like Hugging Face Transformers and spaCy for NLP tasks and Qdrant for neural search. It supports a data-centric workflow by allowing users to define heuristics (e.g., Python functions, active learning models, zero-shot classifiers) to generate noisy labels. These heuristics, combined with manually labeled data, form a noisy label matrix used for analysis, quality assessment, and iterative model improvement.

Quick Start & Requirements

  • Install via pip: pip install kern-refinery
  • Run locally: refinery start
  • Prerequisites: Docker, Python. The system automatically clones necessary repositories.
  • Access: http://localhost:4455
  • Documentation: https://docs.refinery.bio/
  • Demo: https://refinery.bio/

Highlighted Details

  • Semi-automated labeling workflow with manual and programmatic options.
  • Neural search for retrieving similar records and identifying outliers.
  • Data management features include filtering, sorting, searching, and project metrics visualization.
  • Python SDK for programmatic data import/export and integration with tools like Rasa.
  • Integrates with Hugging Face for embeddings and spaCy for tokenization.

Maintenance & Community

  • Active development by Kern AI.
  • Community support via Discord.
  • Newsletter and social media presence on Twitter and LinkedIn.
  • Contributions are welcomed via feedback and code.

Licensing & Compatibility

  • Licensed under the Apache License, Version 2.0.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The open-source version is primarily single-user; multi-user capabilities and enterprise features are part of commercial offerings. Integrating custom Python libraries into the labeling function execution environment requires opening an issue for inclusion.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

argilla by argilla-io

0.2%
5k
Collaboration tool for building high-quality AI datasets
Created 4 years ago
Updated 3 days ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), François Chollet François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), and
42 more.

spaCy by explosion

0.1%
32k
NLP library for production applications
Created 11 years ago
Updated 3 months ago
Feedback? Help us improve.