seqio by google

SeqIO: Task-based datasets, preprocessing, and evaluation library for sequence models

Created 4 years ago

593 stars

Top 54.9% on SourcePulse

View on GitHub

5 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Jinze Bai

Research Scientist at Alibaba Qwen

Edward Sun

Research Scientist at Meta Superintelligence Lab

Gabriel Almeida

Cofounder of Langflow

and 1 more!

Project Summary

SeqIO is a library for processing sequential data for downstream sequence models, targeting researchers and engineers working with large-scale NLP and sequence modeling tasks. It provides a flexible framework for defining datasets, applying preprocessing steps, tokenizing data, and evaluating model performance, aiming to simplify and standardize data pipelines.

How It Works

SeqIO abstracts data processing into Task objects, which encapsulate data sources (e.g., TFDS, text files), preprocessing functions, tokenization vocabularies, and evaluation metrics. It leverages tf.data.Dataset for efficient data pipelines, with minimal TensorFlow dependency, allowing seamless integration with other frameworks like JAX and PyTorch via NumPy iterators. Users define Mixture objects to combine multiple tasks with specified sampling rates.

Quick Start & Requirements

Install via pip: pip install seqio
Requires TensorFlow for tf.data.Dataset operations.
Official Documentation: https://seqio.readthedocs.io/en/latest/
Usage Tutorial: https://github.com/google/seqio#usage

Highlighted Details

Supports various model architectures (encoder-decoder, decoder-only, encoder-only) through FeatureConverter classes.
Enables fine-grained control over offline caching stages for performance optimization.
Offers flexible evaluation capabilities with Evaluator class for metrics like BLEU and exact match.
Includes preprocessors for common tasks like tokenization, appending EOS, and custom transformations.

Maintenance & Community

Developed by Google.
Cited in the T5X paper, indicating active use and development within Google's ML ecosystem.
Community interaction points are not explicitly detailed in the README.

Licensing & Compatibility

Apache License 2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

While designed for minimal TensorFlow use, core data pipeline operations rely on tf.data.Dataset.
Caching stochastic SeqIO Mixtures is not supported.
Preprocessing steps requiring sequence_length must occur after the CacheDatasetPlaceholder.

Health Check

Last Commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days