small-text  by webis-de

Active Learning for efficient text classification data labeling

Created 4 years ago
638 stars

Top 52.1% on SourcePulse

GitHubView on GitHub
Project Summary

Small-Text addresses the challenge of efficiently labeling training data for text classification, particularly when labeled data is scarce. It offers state-of-the-art active learning strategies, allowing users to easily combine pre-implemented query strategies, initialization methods, and stopping criteria with classifiers from scikit-learn, PyTorch, or Hugging Face Transformers. This accelerates the development of supervised text classification models by intelligently selecting the most informative data points for manual annotation, benefiting researchers and practitioners alike.

How It Works

The library provides a unified interface for active learning workflows. Users can select from various scientifically evaluated components and integrate them with popular machine learning frameworks. It supports GPU acceleration via PyTorch and seamless integration with Transformers for leveraging advanced text classification models. This modular design facilitates experimentation and application building, optimizing the data labeling process by reducing manual annotation effort.

Quick Start & Requirements

Highlighted Details

  • Unified interfaces for mixing and matching query strategies, classifiers (sklearn, PyTorch, Transformers), initialization strategies, and stopping criteria.
  • GPU support via PyTorch; CPU-only use has minimal dependencies.
  • Version 2.0.0 (alpha) introduces refined interfaces, new query strategies, improved classifiers, and vector indices.
  • Awarded EACL Best System Demonstration for its introductory paper.

Maintenance & Community

Developed by Christopher Schröder at Leipzig University's NLP group (Webis). The project is funded by the Development Bank of Saxony. Contributions are welcomed. A community survey on active learning in NLP was conducted in March 2026.

Licensing & Compatibility

Licensed under the MIT License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Version 2.0.0.dev3 is an alpha release and may not have stable interfaces. The project emphasizes its progress and feature set, noting that simple counts do not fully represent its capabilities.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Andre Zayarni Andre Zayarni(Cofounder of Qdrant), and
3 more.

refinery by code-kern-ai

0%
1k
Open-source tool for NLP data scaling, assessment, and maintenance
Created 3 years ago
Updated 1 year ago
Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
7 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 1 year ago
Updated 7 months ago
Feedback? Help us improve.