autolabel  by refuel-ai

Python library to label text datasets using LLMs

created 2 years ago
2,250 stars

Top 20.6% on sourcepulse

GitHubView on GitHub
Project Summary

This library enables efficient text dataset labeling, cleaning, and enrichment using Large Language Models (LLMs). It targets ML engineers and researchers seeking to reduce manual labeling costs and time, offering high-accuracy automated labeling with customizable LLM integration.

How It Works

Autolabel streamlines data labeling through a configuration-driven, three-step process: defining labeling guidelines and LLM parameters in JSON, dry-running to validate prompts, and executing the labeling job. It supports various LLM providers (OpenAI, Anthropic, HuggingFace, Google) and advanced techniques like few-shot learning and chain-of-thought prompting to enhance label quality.

Quick Start & Requirements

Highlighted Details

  • Supports classification, question-answering, and named entity recognition tasks.
  • Integrates commercial and open-source LLMs.
  • Offers confidence estimation and explanations for labels.
  • Includes caching and state management for cost and time efficiency.
  • Provides access to Refuel-hosted open-source LLMs for calibration.

Maintenance & Community

  • Active development with a public roadmap.
  • Community engagement via Discord and GitHub issues.
  • Links: Discord Discord, Twitter, Website.

Licensing & Compatibility

  • License: Not explicitly stated in the README.

Limitations & Caveats

The README does not specify the project's license, which may impact commercial use or integration with closed-source projects. Benchmarking details are provided, but specific performance metrics against manual labeling or other tools are not directly summarized.

Health Check
Last commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
54 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Andre Zayarni Andre Zayarni(Cofounder of Qdrant), and
1 more.

refinery by code-kern-ai

0.1%
1k
Open-source tool for NLP data scaling, assessment, and maintenance
created 3 years ago
updated 7 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

1.6%
1k
Synthetic data CLI tool for LLM fine-tuning
created 4 months ago
updated 1 week ago
Feedback? Help us improve.