Discover and explore top open-source AI tools and projects—updated daily.
googleSeqIO: Task-based datasets, preprocessing, and evaluation library for sequence models
Top 55.2% on SourcePulse
SeqIO is a library for processing sequential data for downstream sequence models, targeting researchers and engineers working with large-scale NLP and sequence modeling tasks. It provides a flexible framework for defining datasets, applying preprocessing steps, tokenizing data, and evaluating model performance, aiming to simplify and standardize data pipelines.
How It Works
SeqIO abstracts data processing into Task objects, which encapsulate data sources (e.g., TFDS, text files), preprocessing functions, tokenization vocabularies, and evaluation metrics. It leverages tf.data.Dataset for efficient data pipelines, with minimal TensorFlow dependency, allowing seamless integration with other frameworks like JAX and PyTorch via NumPy iterators. Users define Mixture objects to combine multiple tasks with specified sampling rates.
Quick Start & Requirements
pip install seqiotf.data.Dataset operations.Highlighted Details
FeatureConverter classes.Evaluator class for metrics like BLEU and exact match.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
tf.data.Dataset.sequence_length must occur after the CacheDatasetPlaceholder.6 days ago
1 week
allenai
bespokelabsai
mlfoundations
minimaxir
huggingface