NL-Augmenter by GEM-benchmark

Framework for natural language dataset augmentation

Created 4 years ago

786 stars

Top 44.6% on SourcePulse

View on GitHub

5 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Rodrigo Nader

Cofounder of Langflow

Yaowei Zheng

Author of LLaMA-Factory

Elvis Saravia

Founder of DAIR.AI

and 1 more!

Project Summary

NL-Augmenter is a collaborative repository providing a framework for augmenting text datasets with diverse natural language transformations. It targets researchers and practitioners in NLP who need to enhance datasets for tasks like style transfer, paraphrasing, and data randomization. The project aims to foster community contributions of novel augmentation techniques.

How It Works

The framework is built around a Python library that allows users to define and apply text transformations. Users can create new transformations by copying existing examples, implementing a generate method within a transformation.py file, and defining test cases in test.json. The project encourages contributions via pull requests, with a focus on novel and creative augmentation methods.

Quick Start & Requirements

Install:

git clone https://github.com/GEM-benchmark/NL-Augmenter.git
cd NL-Augmenter
python setup.py sdist
pip install -e .
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz

Requirements: Python 3.7.
Demo: A Colab notebook is available for quick experimentation.

Highlighted Details

Supports a wide range of text augmentation techniques, including randomization, style/syntax changes, and paraphrasing.
Encourages community contributions through a structured pull request process.
Features a code styling standard enforced by black and pre-commit hooks.
Recognizes creative implementations with featured spots on the README and webpage.

Maintenance & Community

The project is a collaborative effort with a public Google Groups email for contact. It is associated with the GEM benchmark initiative.

Licensing & Compatibility

The primary license is not explicitly stated in the provided text, but it mentions that "Some transformations include components released under a different (permissive, open source) license." Users are advised to refer to individual transformation directories for specific license details.

Limitations & Caveats

The project requires Python 3.7, which is now end-of-life. Specific license details for the core framework are not immediately clear from the README, necessitating a review of individual transformation directories.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days