llama-tokenizer-js  by belladoreai

JS tokenizer for LLaMA 1 and 2 models, client-side in browser

Created 2 years ago
359 stars

Top 77.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This library provides a JavaScript tokenizer specifically for LLaMA 1 and LLaMA 2 models, enabling accurate client-side token counting in web applications. It offers a lightweight, zero-dependency solution for developers needing to manage token limits or analyze text length within the browser or Node.js environments.

How It Works

The tokenizer implements a Byte-Pair Encoding (BPE) algorithm, optimized for both runtime performance and minimal bundle size. It bakes the necessary vocabulary and merge data into a single, base64-encoded JavaScript file, eliminating external dependencies and simplifying integration. This approach ensures efficient processing directly within the client's environment.

Quick Start & Requirements

  • Install via npm: npm install llama-tokenizer-js
  • Import and use:
    import llamaTokenizer from 'llama-tokenizer-js';
    console.log(llamaTokenizer.encode("Hello world!").length);
    
  • Browser usage via <script> tag is also supported.
  • No specific hardware or OS requirements beyond standard JavaScript execution environments.

Highlighted Details

  • First client-side LLaMA tokenizer for browsers.
  • Significantly lower latency (~1ms) compared to API calls to Python backends (~300ms).
  • Token counts can differ by up to 20% from OpenAI tokenizers.
  • Compatible with LLaMA models released by Facebook in March 2023 and July 2023.

Maintenance & Community

Developed by belladore.ai with contributions from xenova, blaze2004, imoneoi, and ConProgramming. An example demo/playground is available.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility is confirmed for LLaMA models based on Facebook's March 2023 and July 2023 checkpoints. OpenLLaMA models are noted as incompatible.

Limitations & Caveats

The library is designed for LLaMA 1 and 2; LLaMA 3 requires a separate repository. Training is not supported. Custom tokenizers require manual vocabulary and merge data swapping.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
20 more.

TinyLlama by jzhang38

0.1%
9k
Tiny pretraining project for a 1.1B Llama model
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.