SharpToken by dmitry-brazhenko

Fast C# tokenization for LLM applications

Created 2 years ago

253 stars

Top 99.5% on SourcePulse

Project Summary

Summary

SharpToken is a C# library providing fast and accurate tokenization for natural language text, serving as a direct port of OpenAI's Python tiktoken library. It enables .NET developers to encode and decode text into tokens using various GPT-based encodings, crucial for tasks like LLM prompt sizing and RAG chunking. The library aims to offer a high-performance, low-allocation solution for .NET environments.

How It Works

This library implements the tokenization algorithms from OpenAI's tiktoken for .NET environments, supporting .NET 6, .NET 8, and .NET Standard 2.0. It provides functionality for encoding strings into token IDs and decoding token IDs back into strings, supporting multiple GPT encodings (e.g., cl100k_base, o200k_base). It also includes a high-performance CountTokens method for prompt size estimation. Model names can be mapped to their respective encodings via prefixes, simplifying integration.

Quick Start & Requirements

Installation: Use NuGet Package Manager (Install-Package SharpToken) or the .NET CLI (dotnet add package SharpToken).
Prerequisites: .NET 6, .NET 8, or .NET Standard 2.0 compatible frameworks.
Note: The project's functionality is being integrated into Microsoft.ML.Tokenizers, which is positioned as the central .NET tokenizer library. A stable release of Microsoft.ML.Tokenizers is expected alongside .NET 9.0 (November 2024). A migration guide is available: https://github.com/dotnet/machinelearning/blob/main/docs/code/microsoft-ml-tokenizers-migration-guide.md

Highlighted Details

Performance: Benchmarks indicate SharpToken offers high performance with low memory allocations, particularly on .NET 8.0, leveraging modern CPU instructions. It claims to be the fastest library with the lowest allocations compared to TiktokenSharp and TokenizerLib.
Encoding Support: Supports key GPT encodings like cl100k_base, p50k_base, p50k_edit, r50k_base, o200k_base, and o200k_harmony.
Model Prefix Matching: Provides GetEncodingForModel and GetEncodingNameForModel methods for flexible model-to-encoding resolution based on prefixes like gpt-4- or gpt-3.5-turbo-.
Custom Token Sets: Allows specifying custom sets of allowed or disallowed special tokens during encoding, enhancing control over tokenization.

Maintenance & Community

The project encourages contributions via issues and pull requests. However, the README explicitly states that development is actively shifting to Microsoft.ML.Tokenizers, which will become the central .NET tokenizer library. Users are encouraged to follow the migration path.

Licensing & Compatibility

The README does not explicitly state a license. Commercial use compatibility requires license clarification.

Limitations & Caveats

SharpToken is effectively in maintenance mode, with its core functionality being superseded by Microsoft.ML.Tokenizers. Users are advised to migrate to the latter for future improvements and stable releases, expected in late 2024, as Microsoft.ML.Tokenizers promises improved performance.

Health Check

Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days