NL-Augmenter  by GEM-benchmark

Framework for natural language dataset augmentation

Created 4 years ago
788 stars

Top 44.5% on SourcePulse

GitHubView on GitHub
Project Summary

NL-Augmenter is a collaborative repository providing a framework for augmenting text datasets with diverse natural language transformations. It targets researchers and practitioners in NLP who need to enhance datasets for tasks like style transfer, paraphrasing, and data randomization. The project aims to foster community contributions of novel augmentation techniques.

How It Works

The framework is built around a Python library that allows users to define and apply text transformations. Users can create new transformations by copying existing examples, implementing a generate method within a transformation.py file, and defining test cases in test.json. The project encourages contributions via pull requests, with a focus on novel and creative augmentation methods.

Quick Start & Requirements

  • Install:
    git clone https://github.com/GEM-benchmark/NL-Augmenter.git
    cd NL-Augmenter
    python setup.py sdist
    pip install -e .
    pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
    
  • Requirements: Python 3.7.
  • Demo: A Colab notebook is available for quick experimentation.

Highlighted Details

  • Supports a wide range of text augmentation techniques, including randomization, style/syntax changes, and paraphrasing.
  • Encourages community contributions through a structured pull request process.
  • Features a code styling standard enforced by black and pre-commit hooks.
  • Recognizes creative implementations with featured spots on the README and webpage.

Maintenance & Community

The project is a collaborative effort with a public Google Groups email for contact. It is associated with the GEM benchmark initiative.

Licensing & Compatibility

The primary license is not explicitly stated in the provided text, but it mentions that "Some transformations include components released under a different (permissive, open source) license." Users are advised to refer to individual transformation directories for specific license details.

Limitations & Caveats

The project requires Python 3.7, which is now end-of-life. Specific license details for the core framework are not immediately clear from the README, necessitating a review of individual transformation directories.

Health Check
Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Feedback? Help us improve.