text by pytorch

PyTorch library for NLP tasks

Created 9 years ago

3,563 stars

Top 13.5% on SourcePulse

View on GitHub

16 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Eugene Yan

AI Scientist at AWS

Artidoro Pagnoni

Coauthor of QLoRA; Research Scientist at Meta

Jeff Hammerbacher

Cofounder of Cloudera

and 12 more!

Project Summary

TorchText provides PyTorch-native tools for natural language processing, offering datasets, data processing utilities, pre-trained models, and tokenizers. It aims to simplify NLP workflows for researchers and developers building with PyTorch, though its development has ceased with the 0.18 release.

How It Works

TorchText integrates with torchdata for efficient dataset loading and provides a modular architecture for text processing. It includes components for vocabulary management, text transformations (like tokenization and normalization), and pre-trained model integration, enabling streamlined NLP pipeline construction.

Quick Start & Requirements

Install via pip: pip install torchtext or conda: conda install -c pytorch torchtext.
Requires PyTorch (version compatibility table in README).
Optional: pip install spacy and python -m spacy download en_core_web_sm for SpaCy tokenizer.
Documentation: https://pytorch.org/text/

Highlighted Details

Supports numerous NLP datasets (e.g., WikiText, Multi30k, SQuAD).
Integrates pre-trained models like RoBERTa, XLM-RoBERTa, and T5 variants.
Offers various tokenizers: SentencePiece, GPT-2 BPE, CLIP, RE2, BERT.
Includes tutorials for common NLP tasks like text classification and translation.

Maintenance & Community

TorchText development has stopped, with the 0.18 release (April 2024) being the last stable version.

Licensing & Compatibility

TorchText is released under a BSD-3-Clause license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The library's development has been discontinued, meaning no further updates or bug fixes are expected. Users should be aware of potential compatibility issues with future PyTorch versions or evolving NLP research trends.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days