TOXIGEN  by microsoft

Dataset and code for generating adversarial hate speech

Created 3 years ago
330 stars

Top 82.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ToxiGen provides tools and a large-scale dataset for detecting implicit and adversarial hate speech, particularly targeting minority groups. It's designed for researchers and developers aiming to build more robust content moderation systems that can identify subtle toxicity without relying on slurs or profanity.

How It Works

The project offers two primary methods for generating toxic and benign sentences: Demonstration-Based Prompting, which uses human-provided prompts with large language models (LLMs) like GPT-3, and ALICE, an adversarial approach that pits a generator LLM against a toxicity classifier to create challenging examples. This adversarial setup aims to iteratively improve classifier performance on subtle hate speech.

Quick Start & Requirements

  • Install: pip install toxigen
  • Data Download: Requires a Hugging Face account and use_auth_token=True when using load_dataset.
  • Data Generation: Requires API keys for language models (e.g., GPT-3) and potentially pre-trained classifiers (e.g., HateBERT).
  • Resources: Includes Jupyter Notebooks for guidance.
  • Links: HuggingFace Dataset, Jupyter Notebook Example

Highlighted Details

  • Dataset includes 250k training examples and 27,450 human annotations.
  • Provides fine-tuned HateBERT and RoBERTa checkpoints for toxicity detection.
  • Supports generating data for 13 minority groups and encourages community contributions for new groups/scenarios.
  • Focuses on implicit toxicity, aiming to detect hate speech without explicit slurs or profanity.

Maintenance & Community

  • Community contributions are encouraged via pull requests for prompts and demonstrations.
  • Recent contributions include data for Immigrants and Bisexuality from a Zurich hackathon.
  • Citation: @inproceedings{hartvigsen2022toxigen, title={ToxiGen: A Large-Scale Machine-Generated Dataset for Implicit and Adversarial Hate Speech Detection}, author={Hartvigsen, Thomas and Gabriel, Saadia and Palangi, Hamid and Sap, Maarten and Ray, Dipankar and Kamar, Ece}, booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics}, year={2022}}

Licensing & Compatibility

  • The repository itself does not explicitly state a license. The dataset is available via HuggingFace.
  • The data and trained checkpoints are intended for research purposes only.

Limitations & Caveats

The dataset focuses on implicit toxicity for 13 specific minority groups and may not capture the full complexity or context-dependent nature of problematic language. The authors note that the dataset can be noisy due to its scale and that further research is needed to include more target groups and scenarios.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
7 more.

TextAttack by QData

0.1%
3k
Python framework for NLP adversarial attacks, data augmentation, and model training
Created 6 years ago
Updated 2 months ago
Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

llm-attacks by llm-attacks

0.2%
4k
Attack framework for aligned LLMs, based on a research paper
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.