universal-triggers by Eric-Wallace

NLP attack/analysis research paper (EMNLP 2019)

Created 6 years ago

301 stars

Top 88.7% on SourcePulse

View on GitHub

4 Experts Love This Project

Elvis Saravia

Founder of DAIR.AI

Yaowei Zheng

Author of LLaMA-Factory

Zhuohan Li

Coauthor of vLLM

Thomas Wolf

Cofounder of Hugging Face

Project Summary

This repository provides the official code for the EMNLP 2019 paper "Universal Adversarial Triggers for Attacking and Analyzing NLP." It enables researchers and practitioners to generate universal adversarial triggers for various NLP tasks, aiding in model analysis and security assessment. The primary benefit is a standardized method for probing model vulnerabilities and understanding their decision boundaries.

How It Works

The project implements gradient-based attack methods to discover short sequences of words (triggers) that, when appended to an input, cause a target NLP model to misclassify or behave unexpectedly. The core approach involves iteratively updating a trigger sequence to maximize the gradient of the loss function with respect to the trigger's embedding, effectively finding adversarial perturbations. This method is advantageous as it generates a single trigger effective across multiple inputs, unlike traditional adversarial examples.

Quick Start & Requirements

Installation: Create a conda environment (conda create -n triggers python=3.6), activate it (source activate triggers), and install dependencies (pip install -r requirements.txt).
Prerequisites: PyTorch, HuggingFace Transformers (for GPT-2), and AllenNLP (for SQuAD, SNLI, SST). GPU is recommended for larger models; experiments on SST and SNLI can run without one.
Getting Started: The README recommends starting with the snli or sst attack examples, which are well-documented and illustrate the methodology.
References: Paper, Blog

Highlighted Details

Code supports attacks on sentiment analysis (SST), natural language inference (SNLI), reading comprehension (SQuAD), and the GPT-2 language model.
Gradient-based attack implementation is available in attacks.py.
Utility functions for model evaluation and gradient computation are in utils.py (for AllenNLP models).
Designed for flexibility and extensibility to other models and tasks, especially those within AllenNLP.

Maintenance & Community

Developed by Eric Wallace; contributions via pull requests are welcome. Issues can be reported via the GitHub issues tracker.
Contact: ericwallace@berkeley.edu

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The code is based on older versions of PyTorch, HuggingFace Transformers, and AllenNLP, which may require compatibility adjustments for current environments. The primary focus is on research and analysis, not necessarily production-ready deployment.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days