llama-tokenizer-js by belladoreai

JS tokenizer for LLaMA 1 and 2 models, client-side in browser

Created 2 years ago

362 stars

Top 77.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Simon Willison

Coauthor of Django

Project Summary

This library provides a JavaScript tokenizer specifically for LLaMA 1 and LLaMA 2 models, enabling accurate client-side token counting in web applications. It offers a lightweight, zero-dependency solution for developers needing to manage token limits or analyze text length within the browser or Node.js environments.

How It Works

The tokenizer implements a Byte-Pair Encoding (BPE) algorithm, optimized for both runtime performance and minimal bundle size. It bakes the necessary vocabulary and merge data into a single, base64-encoded JavaScript file, eliminating external dependencies and simplifying integration. This approach ensures efficient processing directly within the client's environment.

Quick Start & Requirements

Install via npm: npm install llama-tokenizer-js

Import and use:

import llamaTokenizer from 'llama-tokenizer-js';
console.log(llamaTokenizer.encode("Hello world!").length);

Browser usage via <script> tag is also supported.
No specific hardware or OS requirements beyond standard JavaScript execution environments.

Highlighted Details

First client-side LLaMA tokenizer for browsers.
Significantly lower latency (~1ms) compared to API calls to Python backends (~300ms).
Token counts can differ by up to 20% from OpenAI tokenizers.
Compatible with LLaMA models released by Facebook in March 2023 and July 2023.

Maintenance & Community

Developed by belladore.ai with contributions from xenova, blaze2004, imoneoi, and ConProgramming. An example demo/playground is available.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility is confirmed for LLaMA models based on Facebook's March 2023 and July 2023 checkpoints. OpenLLaMA models are noted as incompatible.

Limitations & Caveats

The library is designed for LLaMA 1 and 2; LLaMA 3 requires a separate repository. Training is not supported. Custom tokenizers require manual vocabulary and merge data swapping.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days