tokenizers-cpp  by mlc-ai

Cross-platform C++ tokenizer binding library for universal deployment

created 2 years ago
367 stars

Top 78.0% on sourcepulse

GitHubView on GitHub
Project Summary

This C++ library provides a unified, cross-platform interface for HuggingFace tokenizers and SentencePiece, targeting developers building native language model applications. It simplifies tokenizer deployment across diverse platforms like iOS, Android, Windows, Linux, and web browsers by offering a minimal C++ API with reduced dependencies.

How It Works

The project wraps existing Rust implementations of HuggingFace tokenizers and SentencePiece, exposing them through a common C++ interface. It leverages Rust for its performance and cross-compilation capabilities, particularly for mobile and web targets via Emscripten. This approach aims to abstract away the complexities of individual tokenizer libraries and their build processes, enabling seamless integration into C++ projects.

Quick Start & Requirements

  • Add as a Git submodule and include via add_subdirectory in CMake.
  • Requires C++17 support.
  • Rust must be installed for building; cross-compilation targets (e.g., rustup target add aarch64-apple-ios) may be needed.
  • See the example folder for a CMake project example.

Highlighted Details

  • Generates static libraries: libtokenizers_c.a (Rust binding), libsentencepice.a (SentencePiece), and libtokenizers_cpp.a (C++ binding).
  • Supports JavaScript/Wasm export via Emscripten.
  • Used in MLC LLM for native LLM chat application integrations.

Maintenance & Community

  • Developed in part with and used in MLC LLM.
  • Links to the MLC LLM project for integration examples.

Licensing & Compatibility

  • Based on SentencePiece and HuggingFace tokenizers libraries. Specific licenses for these underlying components are not detailed in the README, but the project itself appears to be permissively licensed for commercial use and closed-source linking, given its focus on native deployment.

Limitations & Caveats

The README does not explicitly state the license of the project itself, only that it builds upon other libraries. It focuses on static library generation, and dynamic linking options are not mentioned.

Health Check
Last commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
3
Star History
45 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tim J. Baek Tim J. Baek(Founder of Open WebUI), and
5 more.

gemma.cpp by google

0.1%
7k
C++ inference engine for Google's Gemma models
created 1 year ago
updated 1 day ago
Feedback? Help us improve.