seqio  by google

SeqIO: Task-based datasets, preprocessing, and evaluation library for sequence models

created 4 years ago
583 stars

Top 56.4% on sourcepulse

GitHubView on GitHub
Project Summary

SeqIO is a library for processing sequential data for downstream sequence models, targeting researchers and engineers working with large-scale NLP and sequence modeling tasks. It provides a flexible framework for defining datasets, applying preprocessing steps, tokenizing data, and evaluating model performance, aiming to simplify and standardize data pipelines.

How It Works

SeqIO abstracts data processing into Task objects, which encapsulate data sources (e.g., TFDS, text files), preprocessing functions, tokenization vocabularies, and evaluation metrics. It leverages tf.data.Dataset for efficient data pipelines, with minimal TensorFlow dependency, allowing seamless integration with other frameworks like JAX and PyTorch via NumPy iterators. Users define Mixture objects to combine multiple tasks with specified sampling rates.

Quick Start & Requirements

Highlighted Details

  • Supports various model architectures (encoder-decoder, decoder-only, encoder-only) through FeatureConverter classes.
  • Enables fine-grained control over offline caching stages for performance optimization.
  • Offers flexible evaluation capabilities with Evaluator class for metrics like BLEU and exact match.
  • Includes preprocessors for common tasks like tokenization, appending EOS, and custom transformations.

Maintenance & Community

  • Developed by Google.
  • Cited in the T5X paper, indicating active use and development within Google's ML ecosystem.
  • Community interaction points are not explicitly detailed in the README.

Licensing & Compatibility

  • Apache License 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • While designed for minimal TensorFlow use, core data pipeline operations rely on tf.data.Dataset.
  • Caching stochastic SeqIO Mixtures is not supported.
  • Preprocessing steps requiring sequence_length must occur after the CacheDatasetPlaceholder.
Health Check
Last commit

5 days ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Travis Fischer Travis Fischer(Founder of Agentic).

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 9 months ago
updated 2 weeks ago
Feedback? Help us improve.