universal-triggers  by Eric-Wallace

NLP attack/analysis research paper (EMNLP 2019)

created 6 years ago
295 stars

Top 90.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official code for the EMNLP 2019 paper "Universal Adversarial Triggers for Attacking and Analyzing NLP." It enables researchers and practitioners to generate universal adversarial triggers for various NLP tasks, aiding in model analysis and security assessment. The primary benefit is a standardized method for probing model vulnerabilities and understanding their decision boundaries.

How It Works

The project implements gradient-based attack methods to discover short sequences of words (triggers) that, when appended to an input, cause a target NLP model to misclassify or behave unexpectedly. The core approach involves iteratively updating a trigger sequence to maximize the gradient of the loss function with respect to the trigger's embedding, effectively finding adversarial perturbations. This method is advantageous as it generates a single trigger effective across multiple inputs, unlike traditional adversarial examples.

Quick Start & Requirements

  • Installation: Create a conda environment (conda create -n triggers python=3.6), activate it (source activate triggers), and install dependencies (pip install -r requirements.txt).
  • Prerequisites: PyTorch, HuggingFace Transformers (for GPT-2), and AllenNLP (for SQuAD, SNLI, SST). GPU is recommended for larger models; experiments on SST and SNLI can run without one.
  • Getting Started: The README recommends starting with the snli or sst attack examples, which are well-documented and illustrate the methodology.
  • References: Paper, Blog

Highlighted Details

  • Code supports attacks on sentiment analysis (SST), natural language inference (SNLI), reading comprehension (SQuAD), and the GPT-2 language model.
  • Gradient-based attack implementation is available in attacks.py.
  • Utility functions for model evaluation and gradient computation are in utils.py (for AllenNLP models).
  • Designed for flexibility and extensibility to other models and tasks, especially those within AllenNLP.

Maintenance & Community

  • Developed by Eric Wallace; contributions via pull requests are welcome. Issues can be reported via the GitHub issues tracker.
  • Contact: ericwallace@berkeley.edu

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The code is based on older versions of PyTorch, HuggingFace Transformers, and AllenNLP, which may require compatibility adjustments for current environments. The primary focus is on research and analysis, not necessarily production-ready deployment.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

llm-attacks by llm-attacks

0.4%
4k
Attack framework for aligned LLMs, based on a research paper
created 2 years ago
updated 1 year ago
Feedback? Help us improve.