BlingFire  by microsoft

Fast text tokenization library

created 6 years ago
1,853 stars

Top 23.9% on sourcepulse

GitHubView on GitHub
Project Summary

BlingFire is a high-performance library for finite state machine and regular expression manipulation, primarily focused on Natural Language Processing (NLP) tasks. It offers state-of-the-art tokenization algorithms, including WordPiece, SentencePiece (Unigram LM and BPE), and pattern-based methods, designed for speed and ease of use across multiple programming languages.

How It Works

BlingFire leverages highly optimized C++ implementations of finite state machines and tokenization algorithms. It provides a unified interface for various tokenization models (e.g., BERT, XLNET, RoBERTa), allowing users to load external model files. The library emphasizes minimal configuration and zero-shot learning capabilities for tokenization, with optional normalization and support for custom model creation.

Quick Start & Requirements

  • Install via pip: pip install -U blingfire
  • Requires Python.
  • Precompiled models for various tokenizers (BERT, XLNET, RoBERTa, etc.) are included or can be loaded from external files.
  • Official documentation and examples are available in the repository.

Highlighted Details

  • Claims to be 4-5x faster than Hugging Face Tokenizers and ~2x faster than SentencePiece implementations.
  • Offers a uniform interface for multiple tokenization algorithms, simplifying integration with models like BERT and XLNET.
  • Supports tokenization, sentence breaking, multi-word expression matching, stemming/lemmatization, and induced syllabification.
  • Provides C#, Rust, and JavaScript (WASM) APIs, enabling cross-platform and web-based applications.

Maintenance & Community

  • Developed by the "Bling" team at Microsoft, contributing to Bing's NLP capabilities.
  • Contributions are welcome, subject to a Contributor License Agreement (CLA).
  • Security issues should be reported to Microsoft Security Response Center (MSRC).

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The JavaScript integration is noted as "still in progress" with potential for future changes. While supporting multiple languages, the primary focus and most extensive examples are Python-based.

Health Check
Last commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
1 more.

tokenmonster by alasdairforsythe

0.7%
594
Subword tokenizer and vocabulary trainer for multiple languages
created 2 years ago
updated 1 year ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

minbpe by karpathy

0.2%
10k
Minimal BPE encoder/decoder for LLM tokenization
created 1 year ago
updated 1 year ago
Feedback? Help us improve.