seqio  by google

SeqIO: Task-based datasets, preprocessing, and evaluation library for sequence models

Created 4 years ago
586 stars

Top 55.4% on SourcePulse

GitHubView on GitHub
Project Summary

SeqIO is a library for processing sequential data for downstream sequence models, targeting researchers and engineers working with large-scale NLP and sequence modeling tasks. It provides a flexible framework for defining datasets, applying preprocessing steps, tokenizing data, and evaluating model performance, aiming to simplify and standardize data pipelines.

How It Works

SeqIO abstracts data processing into Task objects, which encapsulate data sources (e.g., TFDS, text files), preprocessing functions, tokenization vocabularies, and evaluation metrics. It leverages tf.data.Dataset for efficient data pipelines, with minimal TensorFlow dependency, allowing seamless integration with other frameworks like JAX and PyTorch via NumPy iterators. Users define Mixture objects to combine multiple tasks with specified sampling rates.

Quick Start & Requirements

Highlighted Details

  • Supports various model architectures (encoder-decoder, decoder-only, encoder-only) through FeatureConverter classes.
  • Enables fine-grained control over offline caching stages for performance optimization.
  • Offers flexible evaluation capabilities with Evaluator class for metrics like BLEU and exact match.
  • Includes preprocessors for common tasks like tokenization, appending EOS, and custom transformations.

Maintenance & Community

  • Developed by Google.
  • Cited in the T5X paper, indicating active use and development within Google's ML ecosystem.
  • Community interaction points are not explicitly detailed in the README.

Licensing & Compatibility

  • Apache License 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • While designed for minimal TensorFlow use, core data pipeline operations rely on tf.data.Dataset.
  • Caching stochastic SeqIO Mixtures is not supported.
  • Preprocessing steps requiring sequence_length must occur after the CacheDatasetPlaceholder.
Health Check
Last Commit

3 days ago

Responsiveness

1 week

Pull Requests (30d)
5
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
3 more.

unified-io-2 by allenai

0.3%
626
Unified-IO 2 code for training, inference, and demo
Created 1 year ago
Updated 1 year ago
Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
11 more.

datatrove by huggingface

0.9%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.