SeqIO: Task-based datasets, preprocessing, and evaluation library for sequence models
Top 56.4% on sourcepulse
SeqIO is a library for processing sequential data for downstream sequence models, targeting researchers and engineers working with large-scale NLP and sequence modeling tasks. It provides a flexible framework for defining datasets, applying preprocessing steps, tokenizing data, and evaluating model performance, aiming to simplify and standardize data pipelines.
How It Works
SeqIO abstracts data processing into Task
objects, which encapsulate data sources (e.g., TFDS, text files), preprocessing functions, tokenization vocabularies, and evaluation metrics. It leverages tf.data.Dataset
for efficient data pipelines, with minimal TensorFlow dependency, allowing seamless integration with other frameworks like JAX and PyTorch via NumPy iterators. Users define Mixture
objects to combine multiple tasks with specified sampling rates.
Quick Start & Requirements
pip install seqio
tf.data.Dataset
operations.Highlighted Details
FeatureConverter
classes.Evaluator
class for metrics like BLEU and exact match.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
tf.data.Dataset
.sequence_length
must occur after the CacheDatasetPlaceholder
.5 days ago
1 week