tokenizer  by sugarme

Go NLP tokenizers for Hugging Face models

Created 5 years ago
269 stars

Top 95.5% on SourcePulse

GitHubView on GitHub
Project Summary

This Go package provides pure Go implementations of NLP tokenizers, inspired by HuggingFace's Tokenizers library. It enables Gophers to integrate advanced NLP models for training, testing, and inference directly within Go applications, facilitating faster production software development.

How It Works

The tokenizer is modular, featuring distinct sub-packages for Normalizer, Pretokenizer, Tokenizer, and Post-processing. It supports key tokenization models including Word Level, Wordpiece, and Byte Pair Encoding (BPE). This design allows for both training new models from scratch and fine-tuning existing ones, offering flexibility for diverse NLP tasks.

Quick Start & Requirements

  • Primary install: go get github.com/sugarme/tokenizer
  • Prerequisites: Go toolchain.
  • Example usage demonstrates loading pretrained HuggingFace tokenizers (e.g., bert-base-uncased) via the pretrained subpackage.
  • Detailed APIs are available on pkg.go.dev.

Highlighted Details

  • Implements Word Level, Wordpiece, and BPE tokenization models.
  • Compatible with loading pretrained models from HuggingFace.
  • Modular design with distinct Normalizer, Pretokenizer, Tokenizer, and Post-processing components.
  • Enables training new models or fine-tuning existing ones.

Maintenance & Community

No specific community channels, roadmap, or contributor information is detailed in the README.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license suitable for commercial and closed-source applications.

Limitations & Caveats

The README does not detail performance benchmarks, specific model compatibility beyond HuggingFace, or provide information on community support or project roadmap.

Health Check
Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
2
Issues (30d)
4
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Eugene Yan Eugene Yan(AI Scientist at AWS), and
14 more.

text by pytorch

0.0%
4k
PyTorch library for NLP tasks
Created 8 years ago
Updated 1 week ago
Feedback? Help us improve.