tokenizers-cpp  by mlc-ai

Cross-platform C++ tokenizer binding library for universal deployment

Created 2 years ago
388 stars

Top 73.9% on SourcePulse

GitHubView on GitHub
Project Summary

This C++ library provides a unified, cross-platform interface for HuggingFace tokenizers and SentencePiece, targeting developers building native language model applications. It simplifies tokenizer deployment across diverse platforms like iOS, Android, Windows, Linux, and web browsers by offering a minimal C++ API with reduced dependencies.

How It Works

The project wraps existing Rust implementations of HuggingFace tokenizers and SentencePiece, exposing them through a common C++ interface. It leverages Rust for its performance and cross-compilation capabilities, particularly for mobile and web targets via Emscripten. This approach aims to abstract away the complexities of individual tokenizer libraries and their build processes, enabling seamless integration into C++ projects.

Quick Start & Requirements

  • Add as a Git submodule and include via add_subdirectory in CMake.
  • Requires C++17 support.
  • Rust must be installed for building; cross-compilation targets (e.g., rustup target add aarch64-apple-ios) may be needed.
  • See the example folder for a CMake project example.

Highlighted Details

  • Generates static libraries: libtokenizers_c.a (Rust binding), libsentencepice.a (SentencePiece), and libtokenizers_cpp.a (C++ binding).
  • Supports JavaScript/Wasm export via Emscripten.
  • Used in MLC LLM for native LLM chat application integrations.

Maintenance & Community

  • Developed in part with and used in MLC LLM.
  • Links to the MLC LLM project for integration examples.

Licensing & Compatibility

  • Based on SentencePiece and HuggingFace tokenizers libraries. Specific licenses for these underlying components are not detailed in the README, but the project itself appears to be permissively licensed for commercial use and closed-source linking, given its focus on native deployment.

Limitations & Caveats

The README does not explicitly state the license of the project itself, only that it builds upon other libraries. It focuses on static library generation, and dynamic linking options are not mentioned.

Health Check
Last Commit

1 month ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
8 more.

llm-vscode by huggingface

0.1%
1k
VSCode extension for LLM-powered code development
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
1 more.

SillyTavern by SillyTavern

1.1%
18k
LLM frontend for power users
Created 2 years ago
Updated 1 day ago
Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
4 more.

kaldi by kaldi-asr

0.1%
15k
Speech recognition toolkit for Linux, macOS, Cygwin, and Windows
Created 10 years ago
Updated 1 month ago
Feedback? Help us improve.