tokenizer  by sugarme

Go NLP tokenizers for Hugging Face models

created 5 years ago
255 stars

Top 99.2% on sourcepulse

GitHubView on GitHub
Project Summary

This Go package provides pure Go implementations of NLP tokenizers, inspired by HuggingFace's Tokenizers library. It enables Gophers to integrate advanced NLP models for training, testing, and inference directly within Go applications, facilitating faster production software development.

How It Works

The tokenizer is modular, featuring distinct sub-packages for Normalizer, Pretokenizer, Tokenizer, and Post-processing. It supports key tokenization models including Word Level, Wordpiece, and Byte Pair Encoding (BPE). This design allows for both training new models from scratch and fine-tuning existing ones, offering flexibility for diverse NLP tasks.

Quick Start & Requirements

  • Primary install: go get github.com/sugarme/tokenizer
  • Prerequisites: Go toolchain.
  • Example usage demonstrates loading pretrained HuggingFace tokenizers (e.g., bert-base-uncased) via the pretrained subpackage.
  • Detailed APIs are available on pkg.go.dev.

Highlighted Details

  • Implements Word Level, Wordpiece, and BPE tokenization models.
  • Compatible with loading pretrained models from HuggingFace.
  • Modular design with distinct Normalizer, Pretokenizer, Tokenizer, and Post-processing components.
  • Enables training new models or fine-tuning existing ones.

Maintenance & Community

No specific community channels, roadmap, or contributor information is detailed in the README.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license suitable for commercial and closed-source applications.

Limitations & Caveats

The README does not detail performance benchmarks, specific model compatibility beyond HuggingFace, or provide information on community support or project roadmap.

Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
1 more.

tokenmonster by alasdairforsythe

0.7%
594
Subword tokenizer and vocabulary trainer for multiple languages
created 2 years ago
updated 1 year ago
Starred by Ying Sheng Ying Sheng(Author of SGLang) and Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

xgen by salesforce

0%
720
LLM research release with 8k sequence length
created 2 years ago
updated 6 months ago
Feedback? Help us improve.