TOXIGEN by microsoft

Dataset and code for generating adversarial hate speech

Created 3 years ago

345 stars

Top 80.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elvis Saravia

Founder of DAIR.AI

Project Summary

ToxiGen provides tools and a large-scale dataset for detecting implicit and adversarial hate speech, particularly targeting minority groups. It's designed for researchers and developers aiming to build more robust content moderation systems that can identify subtle toxicity without relying on slurs or profanity.

How It Works

The project offers two primary methods for generating toxic and benign sentences: Demonstration-Based Prompting, which uses human-provided prompts with large language models (LLMs) like GPT-3, and ALICE, an adversarial approach that pits a generator LLM against a toxicity classifier to create challenging examples. This adversarial setup aims to iteratively improve classifier performance on subtle hate speech.

Quick Start & Requirements

Install: pip install toxigen
Data Download: Requires a Hugging Face account and use_auth_token=True when using load_dataset.
Data Generation: Requires API keys for language models (e.g., GPT-3) and potentially pre-trained classifiers (e.g., HateBERT).
Resources: Includes Jupyter Notebooks for guidance.
Links: HuggingFace Dataset, Jupyter Notebook Example

Highlighted Details

Dataset includes 250k training examples and 27,450 human annotations.
Provides fine-tuned HateBERT and RoBERTa checkpoints for toxicity detection.
Supports generating data for 13 minority groups and encourages community contributions for new groups/scenarios.
Focuses on implicit toxicity, aiming to detect hate speech without explicit slurs or profanity.

Maintenance & Community

Community contributions are encouraged via pull requests for prompts and demonstrations.
Recent contributions include data for Immigrants and Bisexuality from a Zurich hackathon.
Citation: @inproceedings{hartvigsen2022toxigen, title={ToxiGen: A Large-Scale Machine-Generated Dataset for Implicit and Adversarial Hate Speech Detection}, author={Hartvigsen, Thomas and Gabriel, Saadia and Palangi, Hamid and Sap, Maarten and Ray, Dipankar and Kamar, Ece}, booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics}, year={2022}}

Licensing & Compatibility

The repository itself does not explicitly state a license. The dataset is available via HuggingFace.
The data and trained checkpoints are intended for research purposes only.

Limitations & Caveats

The dataset focuses on implicit toxicity for 13 specific minority groups and may not capture the full complexity or context-dependent nature of problematic language. The authors note that the dataset can be noisy due to its scale and that further research is needed to include more target groups and scenarios.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days