BlingFire  by microsoft

Fast text tokenization library

Created 6 years ago
1,877 stars

Top 22.9% on SourcePulse

GitHubView on GitHub
Project Summary

BlingFire is a high-performance library for finite state machine and regular expression manipulation, primarily focused on Natural Language Processing (NLP) tasks. It offers state-of-the-art tokenization algorithms, including WordPiece, SentencePiece (Unigram LM and BPE), and pattern-based methods, designed for speed and ease of use across multiple programming languages.

How It Works

BlingFire leverages highly optimized C++ implementations of finite state machines and tokenization algorithms. It provides a unified interface for various tokenization models (e.g., BERT, XLNET, RoBERTa), allowing users to load external model files. The library emphasizes minimal configuration and zero-shot learning capabilities for tokenization, with optional normalization and support for custom model creation.

Quick Start & Requirements

  • Install via pip: pip install -U blingfire
  • Requires Python.
  • Precompiled models for various tokenizers (BERT, XLNET, RoBERTa, etc.) are included or can be loaded from external files.
  • Official documentation and examples are available in the repository.

Highlighted Details

  • Claims to be 4-5x faster than Hugging Face Tokenizers and ~2x faster than SentencePiece implementations.
  • Offers a uniform interface for multiple tokenization algorithms, simplifying integration with models like BERT and XLNET.
  • Supports tokenization, sentence breaking, multi-word expression matching, stemming/lemmatization, and induced syllabification.
  • Provides C#, Rust, and JavaScript (WASM) APIs, enabling cross-platform and web-based applications.

Maintenance & Community

  • Developed by the "Bling" team at Microsoft, contributing to Bing's NLP capabilities.
  • Contributions are welcome, subject to a Contributor License Agreement (CLA).
  • Security issues should be reported to Microsoft Security Response Center (MSRC).

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The JavaScript integration is noted as "still in progress" with potential for future changes. While supporting multiple languages, the primary focus and most extensive examples are Python-based.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
2 more.

chonkie by chonkie-inc

2.9%
4k
Chunking library for RAG applications
Created 9 months ago
Updated 1 day ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), François Chollet François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), and
43 more.

spaCy by explosion

0.1%
33k
NLP library for production applications
Created 11 years ago
Updated 1 month ago
Feedback? Help us improve.