Discover and explore top open-source AI tools and projects—updated daily.
dmitry-brazhenkoFast C# tokenization for LLM applications
Top 99.5% on SourcePulse
Summary
SharpToken is a C# library providing fast and accurate tokenization for natural language text, serving as a direct port of OpenAI's Python tiktoken library. It enables .NET developers to encode and decode text into tokens using various GPT-based encodings, crucial for tasks like LLM prompt sizing and RAG chunking. The library aims to offer a high-performance, low-allocation solution for .NET environments.
How It Works
This library implements the tokenization algorithms from OpenAI's tiktoken for .NET environments, supporting .NET 6, .NET 8, and .NET Standard 2.0. It provides functionality for encoding strings into token IDs and decoding token IDs back into strings, supporting multiple GPT encodings (e.g., cl100k_base, o200k_base). It also includes a high-performance CountTokens method for prompt size estimation. Model names can be mapped to their respective encodings via prefixes, simplifying integration.
Quick Start & Requirements
Install-Package SharpToken) or the .NET CLI (dotnet add package SharpToken).Microsoft.ML.Tokenizers, which is positioned as the central .NET tokenizer library. A stable release of Microsoft.ML.Tokenizers is expected alongside .NET 9.0 (November 2024). A migration guide is available: https://github.com/dotnet/machinelearning/blob/main/docs/code/microsoft-ml-tokenizers-migration-guide.mdHighlighted Details
TiktokenSharp and TokenizerLib.cl100k_base, p50k_base, p50k_edit, r50k_base, o200k_base, and o200k_harmony.GetEncodingForModel and GetEncodingNameForModel methods for flexible model-to-encoding resolution based on prefixes like gpt-4- or gpt-3.5-turbo-.Maintenance & Community
The project encourages contributions via issues and pull requests. However, the README explicitly states that development is actively shifting to Microsoft.ML.Tokenizers, which will become the central .NET tokenizer library. Users are encouraged to follow the migration path.
Licensing & Compatibility
The README does not explicitly state a license. Commercial use compatibility requires license clarification.
Limitations & Caveats
SharpToken is effectively in maintenance mode, with its core functionality being superseded by Microsoft.ML.Tokenizers. Users are advised to migrate to the latter for future improvements and stable releases, expected in late 2024, as Microsoft.ML.Tokenizers promises improved performance.
5 months ago
Inactive
xlang-ai
karpathy
guidance-ai