BlingFire by microsoft

Fast text tokenization library

Created 6 years ago

1,877 stars

Top 22.9% on SourcePulse

View on GitHub

6 Experts Love This Project

Research Scientist at Ai2

Eric Zhu

Coauthor of AutoGen; Research Scientist at Microsoft Research

and 2 more!

Project Summary

BlingFire is a high-performance library for finite state machine and regular expression manipulation, primarily focused on Natural Language Processing (NLP) tasks. It offers state-of-the-art tokenization algorithms, including WordPiece, SentencePiece (Unigram LM and BPE), and pattern-based methods, designed for speed and ease of use across multiple programming languages.

How It Works

BlingFire leverages highly optimized C++ implementations of finite state machines and tokenization algorithms. It provides a unified interface for various tokenization models (e.g., BERT, XLNET, RoBERTa), allowing users to load external model files. The library emphasizes minimal configuration and zero-shot learning capabilities for tokenization, with optional normalization and support for custom model creation.

Quick Start & Requirements

Install via pip: pip install -U blingfire
Requires Python.
Precompiled models for various tokenizers (BERT, XLNET, RoBERTa, etc.) are included or can be loaded from external files.
Official documentation and examples are available in the repository.

Highlighted Details

Claims to be 4-5x faster than Hugging Face Tokenizers and ~2x faster than SentencePiece implementations.
Offers a uniform interface for multiple tokenization algorithms, simplifying integration with models like BERT and XLNET.
Supports tokenization, sentence breaking, multi-word expression matching, stemming/lemmatization, and induced syllabification.
Provides C#, Rust, and JavaScript (WASM) APIs, enabling cross-platform and web-based applications.

Maintenance & Community

Developed by the "Bling" team at Microsoft, contributing to Bing's NLP capabilities.
Contributions are welcome, subject to a Contributor License Agreement (CLA).
Security issues should be reported to Microsoft Security Response Center (MSRC).

Licensing & Compatibility

Licensed under the MIT License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The JavaScript integration is noted as "still in progress" with potential for future changes. While supporting multiple languages, the primary focus and most extensive examples are Python-based.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days