BlingFire  by microsoft

Fast text tokenization library

Created 6 years ago
1,857 stars

Top 23.4% on SourcePulse

GitHubView on GitHub
Project Summary

BlingFire is a high-performance library for finite state machine and regular expression manipulation, primarily focused on Natural Language Processing (NLP) tasks. It offers state-of-the-art tokenization algorithms, including WordPiece, SentencePiece (Unigram LM and BPE), and pattern-based methods, designed for speed and ease of use across multiple programming languages.

How It Works

BlingFire leverages highly optimized C++ implementations of finite state machines and tokenization algorithms. It provides a unified interface for various tokenization models (e.g., BERT, XLNET, RoBERTa), allowing users to load external model files. The library emphasizes minimal configuration and zero-shot learning capabilities for tokenization, with optional normalization and support for custom model creation.

Quick Start & Requirements

  • Install via pip: pip install -U blingfire
  • Requires Python.
  • Precompiled models for various tokenizers (BERT, XLNET, RoBERTa, etc.) are included or can be loaded from external files.
  • Official documentation and examples are available in the repository.

Highlighted Details

  • Claims to be 4-5x faster than Hugging Face Tokenizers and ~2x faster than SentencePiece implementations.
  • Offers a uniform interface for multiple tokenization algorithms, simplifying integration with models like BERT and XLNET.
  • Supports tokenization, sentence breaking, multi-word expression matching, stemming/lemmatization, and induced syllabification.
  • Provides C#, Rust, and JavaScript (WASM) APIs, enabling cross-platform and web-based applications.

Maintenance & Community

  • Developed by the "Bling" team at Microsoft, contributing to Bing's NLP capabilities.
  • Contributions are welcome, subject to a Contributor License Agreement (CLA).
  • Security issues should be reported to Microsoft Security Response Center (MSRC).

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The JavaScript integration is noted as "still in progress" with potential for future changes. While supporting multiple languages, the primary focus and most extensive examples are Python-based.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Eugene Yan Eugene Yan(AI Scientist at AWS), and
14 more.

text by pytorch

0.0%
4k
PyTorch library for NLP tasks
Created 8 years ago
Updated 1 week ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), François Chollet François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), and
42 more.

spaCy by explosion

0.1%
32k
NLP library for production applications
Created 11 years ago
Updated 3 months ago
Feedback? Help us improve.