byt5 by google-research

Byte-to-byte model research paper

Created 4 years ago

530 stars

Top 59.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jon Bratseth

Cofounder of Vespa

Project Summary

ByT5 offers a token-free approach to large language models, operating directly on UTF-8 bytes instead of subword units. This simplifies preprocessing and demonstrates competitive performance against mT5, particularly excelling on noisy text or tasks sensitive to spelling and pronunciation. It is targeted at researchers and practitioners working with diverse text data or seeking to reduce tokenization overhead.

How It Works

ByT5 extends the mT5 architecture by replacing the standard text tokenizer with a byte-level processing pipeline. This allows the model to handle any UTF-8 encoded text without requiring a predefined vocabulary or complex tokenization algorithms. The byte-level representation is argued to be more robust to variations in text, such as misspellings or informal language.

Quick Start & Requirements

Installation: Requires the t5 library.
Prerequisites: Primarily designed for TPUs (e.g., v3-256) for training and fine-tuning, with specific GCP project, zone, and bucket configurations. CUDA is not explicitly mentioned but implied for GPU usage if not using TPUs.
Usage: Training and fine-tuning examples are provided using t5.models.mesh_transformer_main with specific Gin configuration files and flags.
Resources: Training from scratch requires significant TPU resources (e.g., v3-256 for 1M steps). Fine-tuning examples also utilize large TPUs.
Documentation: General instructions are in the t5 repository; ByT5-specific tasks require --module_import="byt5.tasks".

Highlighted Details

Operates directly on UTF-8 bytes, eliminating the need for tokenizers.
Parameter-matched ByT5 models are competitive with mT5.
Outperforms mT5 on noisy text and spelling-sensitive tasks.
Released checkpoints range from 300M to 13B parameters.

Maintenance & Community

This project is part of Google Research. It is noted as "not an officially supported Google product." No community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given its origin from Google Research and the nature of similar projects, it is likely to be Apache 2.0 or a similar permissive license, but this requires verification. Compatibility for commercial use would depend on the specific license.

Limitations & Caveats

The provided examples and training scripts are heavily geared towards Google's TPU infrastructure and GCP, potentially making setup on other platforms complex. The README does not detail specific limitations regarding sequence length handling or potential performance differences on non-English or highly structured data.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days