TOXIGEN  by microsoft

Dataset and code for generating adversarial hate speech

created 3 years ago
325 stars

Top 85.0% on sourcepulse

GitHubView on GitHub
Project Summary

ToxiGen provides tools and a large-scale dataset for detecting implicit and adversarial hate speech, particularly targeting minority groups. It's designed for researchers and developers aiming to build more robust content moderation systems that can identify subtle toxicity without relying on slurs or profanity.

How It Works

The project offers two primary methods for generating toxic and benign sentences: Demonstration-Based Prompting, which uses human-provided prompts with large language models (LLMs) like GPT-3, and ALICE, an adversarial approach that pits a generator LLM against a toxicity classifier to create challenging examples. This adversarial setup aims to iteratively improve classifier performance on subtle hate speech.

Quick Start & Requirements

  • Install: pip install toxigen
  • Data Download: Requires a Hugging Face account and use_auth_token=True when using load_dataset.
  • Data Generation: Requires API keys for language models (e.g., GPT-3) and potentially pre-trained classifiers (e.g., HateBERT).
  • Resources: Includes Jupyter Notebooks for guidance.
  • Links: HuggingFace Dataset, Jupyter Notebook Example

Highlighted Details

  • Dataset includes 250k training examples and 27,450 human annotations.
  • Provides fine-tuned HateBERT and RoBERTa checkpoints for toxicity detection.
  • Supports generating data for 13 minority groups and encourages community contributions for new groups/scenarios.
  • Focuses on implicit toxicity, aiming to detect hate speech without explicit slurs or profanity.

Maintenance & Community

  • Community contributions are encouraged via pull requests for prompts and demonstrations.
  • Recent contributions include data for Immigrants and Bisexuality from a Zurich hackathon.
  • Citation: @inproceedings{hartvigsen2022toxigen, title={ToxiGen: A Large-Scale Machine-Generated Dataset for Implicit and Adversarial Hate Speech Detection}, author={Hartvigsen, Thomas and Gabriel, Saadia and Palangi, Hamid and Sap, Maarten and Ray, Dipankar and Kamar, Ece}, booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics}, year={2022}}

Licensing & Compatibility

  • The repository itself does not explicitly state a license. The dataset is available via HuggingFace.
  • The data and trained checkpoints are intended for research purposes only.

Limitations & Caveats

The dataset focuses on implicit toxicity for 13 specific minority groups and may not capture the full complexity or context-dependent nature of problematic language. The authors note that the dataset can be noisy due to its scale and that further research is needed to include more target groups and scenarios.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Travis Fischer Travis Fischer(Founder of Agentic), and
2 more.

hh-rlhf by anthropics

0.2%
2k
RLHF dataset for training safe AI assistants
created 3 years ago
updated 1 month ago
Feedback? Help us improve.