Framework for natural language dataset augmentation
Top 45.4% on sourcepulse
NL-Augmenter is a collaborative repository providing a framework for augmenting text datasets with diverse natural language transformations. It targets researchers and practitioners in NLP who need to enhance datasets for tasks like style transfer, paraphrasing, and data randomization. The project aims to foster community contributions of novel augmentation techniques.
How It Works
The framework is built around a Python library that allows users to define and apply text transformations. Users can create new transformations by copying existing examples, implementing a generate
method within a transformation.py
file, and defining test cases in test.json
. The project encourages contributions via pull requests, with a focus on novel and creative augmentation methods.
Quick Start & Requirements
git clone https://github.com/GEM-benchmark/NL-Augmenter.git
cd NL-Augmenter
python setup.py sdist
pip install -e .
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
Highlighted Details
black
and pre-commit hooks.Maintenance & Community
The project is a collaborative effort with a public Google Groups email for contact. It is associated with the GEM benchmark initiative.
Licensing & Compatibility
The primary license is not explicitly stated in the provided text, but it mentions that "Some transformations include components released under a different (permissive, open source) license." Users are advised to refer to individual transformation directories for specific license details.
Limitations & Caveats
The project requires Python 3.7, which is now end-of-life. Specific license details for the core framework are not immediately clear from the README, necessitating a review of individual transformation directories.
1 year ago
Inactive