Cross-platform C++ tokenizer binding library for universal deployment
Top 78.0% on sourcepulse
This C++ library provides a unified, cross-platform interface for HuggingFace tokenizers and SentencePiece, targeting developers building native language model applications. It simplifies tokenizer deployment across diverse platforms like iOS, Android, Windows, Linux, and web browsers by offering a minimal C++ API with reduced dependencies.
How It Works
The project wraps existing Rust implementations of HuggingFace tokenizers and SentencePiece, exposing them through a common C++ interface. It leverages Rust for its performance and cross-compilation capabilities, particularly for mobile and web targets via Emscripten. This approach aims to abstract away the complexities of individual tokenizer libraries and their build processes, enabling seamless integration into C++ projects.
Quick Start & Requirements
add_subdirectory
in CMake.rustup target add aarch64-apple-ios
) may be needed.example
folder for a CMake project example.Highlighted Details
libtokenizers_c.a
(Rust binding), libsentencepice.a
(SentencePiece), and libtokenizers_cpp.a
(C++ binding).Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not explicitly state the license of the project itself, only that it builds upon other libraries. It focuses on static library generation, and dynamic linking options are not mentioned.
4 days ago
Inactive