byt5  by google-research

Byte-to-byte model research paper

created 4 years ago
514 stars

Top 61.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ByT5 offers a token-free approach to large language models, operating directly on UTF-8 bytes instead of subword units. This simplifies preprocessing and demonstrates competitive performance against mT5, particularly excelling on noisy text or tasks sensitive to spelling and pronunciation. It is targeted at researchers and practitioners working with diverse text data or seeking to reduce tokenization overhead.

How It Works

ByT5 extends the mT5 architecture by replacing the standard text tokenizer with a byte-level processing pipeline. This allows the model to handle any UTF-8 encoded text without requiring a predefined vocabulary or complex tokenization algorithms. The byte-level representation is argued to be more robust to variations in text, such as misspellings or informal language.

Quick Start & Requirements

  • Installation: Requires the t5 library.
  • Prerequisites: Primarily designed for TPUs (e.g., v3-256) for training and fine-tuning, with specific GCP project, zone, and bucket configurations. CUDA is not explicitly mentioned but implied for GPU usage if not using TPUs.
  • Usage: Training and fine-tuning examples are provided using t5.models.mesh_transformer_main with specific Gin configuration files and flags.
  • Resources: Training from scratch requires significant TPU resources (e.g., v3-256 for 1M steps). Fine-tuning examples also utilize large TPUs.
  • Documentation: General instructions are in the t5 repository; ByT5-specific tasks require --module_import="byt5.tasks".

Highlighted Details

  • Operates directly on UTF-8 bytes, eliminating the need for tokenizers.
  • Parameter-matched ByT5 models are competitive with mT5.
  • Outperforms mT5 on noisy text and spelling-sensitive tasks.
  • Released checkpoints range from 300M to 13B parameters.

Maintenance & Community

This project is part of Google Research. It is noted as "not an officially supported Google product." No community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given its origin from Google Research and the nature of similar projects, it is likely to be Apache 2.0 or a similar permissive license, but this requires verification. Compatibility for commercial use would depend on the specific license.

Limitations & Caveats

The provided examples and training scripts are heavily geared towards Google's TPU infrastructure and GCP, potentially making setup on other platforms complex. The README does not detail specific limitations regarding sequence length handling or potential performance differences on non-English or highly structured data.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Author of SGLang) and Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

xgen by salesforce

0%
720
LLM research release with 8k sequence length
created 2 years ago
updated 6 months ago
Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

LWM by LargeWorldModel

0.0%
7k
Multimodal autoregressive model for long-context video/text
created 1 year ago
updated 9 months ago
Feedback? Help us improve.