Dataset and code for generating adversarial hate speech
Top 85.0% on sourcepulse
ToxiGen provides tools and a large-scale dataset for detecting implicit and adversarial hate speech, particularly targeting minority groups. It's designed for researchers and developers aiming to build more robust content moderation systems that can identify subtle toxicity without relying on slurs or profanity.
How It Works
The project offers two primary methods for generating toxic and benign sentences: Demonstration-Based Prompting, which uses human-provided prompts with large language models (LLMs) like GPT-3, and ALICE, an adversarial approach that pits a generator LLM against a toxicity classifier to create challenging examples. This adversarial setup aims to iteratively improve classifier performance on subtle hate speech.
Quick Start & Requirements
pip install toxigen
use_auth_token=True
when using load_dataset
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The dataset focuses on implicit toxicity for 13 specific minority groups and may not capture the full complexity or context-dependent nature of problematic language. The authors note that the dataset can be noisy due to its scale and that further research is needed to include more target groups and scenarios.
1 year ago
Inactive