llama-tokenizer-js  by belladoreai

JS tokenizer for LLaMA 1 and 2 models, client-side in browser

created 2 years ago
355 stars

Top 79.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This library provides a JavaScript tokenizer specifically for LLaMA 1 and LLaMA 2 models, enabling accurate client-side token counting in web applications. It offers a lightweight, zero-dependency solution for developers needing to manage token limits or analyze text length within the browser or Node.js environments.

How It Works

The tokenizer implements a Byte-Pair Encoding (BPE) algorithm, optimized for both runtime performance and minimal bundle size. It bakes the necessary vocabulary and merge data into a single, base64-encoded JavaScript file, eliminating external dependencies and simplifying integration. This approach ensures efficient processing directly within the client's environment.

Quick Start & Requirements

  • Install via npm: npm install llama-tokenizer-js
  • Import and use:
    import llamaTokenizer from 'llama-tokenizer-js';
    console.log(llamaTokenizer.encode("Hello world!").length);
    
  • Browser usage via <script> tag is also supported.
  • No specific hardware or OS requirements beyond standard JavaScript execution environments.

Highlighted Details

  • First client-side LLaMA tokenizer for browsers.
  • Significantly lower latency (~1ms) compared to API calls to Python backends (~300ms).
  • Token counts can differ by up to 20% from OpenAI tokenizers.
  • Compatible with LLaMA models released by Facebook in March 2023 and July 2023.

Maintenance & Community

Developed by belladore.ai with contributions from xenova, blaze2004, imoneoi, and ConProgramming. An example demo/playground is available.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility is confirmed for LLaMA models based on Facebook's March 2023 and July 2023 checkpoints. OpenLLaMA models are noted as incompatible.

Limitations & Caveats

The library is designed for LLaMA 1 and 2; LLaMA 3 requires a separate repository. Training is not supported. Custom tokenizers require manual vocabulary and merge data swapping.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Simon Willison Simon Willison(Author of Django), Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
1 more.

GPT-3-Encoder by latitudegames

0%
719
JS library for GPT-2/GPT-3 text tokenization
created 4 years ago
updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
10 more.

open_llama by openlm-research

0.0%
8k
Open-source reproduction of LLaMA models
created 2 years ago
updated 2 years ago
Feedback? Help us improve.