detoxify by unitaryai

Trained models for toxic comment classification

Created 5 years ago

1,164 stars

Top 33.2% on SourcePulse

2 Experts Love This Project

hammer

Jeff Hammerbacher

Cofounder of Cloudera

luiscape

Cofounder of Lightning AI

Project Summary

Detoxify provides pre-trained models and code for classifying toxic comments across multiple datasets and languages. It is designed for researchers and developers working on content moderation, bias detection, and natural language understanding, offering a user-friendly interface to identify various forms of toxicity in text.

How It Works

The library leverages state-of-the-art transformer models (BERT, RoBERTa, XLM-RoBERTa) fine-tuned on Jigsaw's toxic comment datasets. It employs PyTorch Lightning for efficient training and Hugging Face Transformers for model architecture and tokenization. This approach allows for high performance and broad language support, with specific models optimized for general toxicity, unintended bias, and multilingual classification.

Quick Start & Requirements

Install via pip: pip install detoxify
For inference: PyTorch, Transformers. For training: Kaggle API, pandas.
Models can be loaded directly from PyTorch Hub or checkpoints.
Supports CPU and CUDA devices.
Example usage and detailed prediction/training scripts are available.

Highlighted Details

Achieved high AUC scores on Jigsaw challenges (e.g., 93.74% for unbiased, 92.11% for multilingual).
Offers smaller, lightweight models (e.g., original-small, unbiased-small) for reduced resource usage.
Multilingual model supports English, French, Spanish, Italian, Portuguese, Turkish, and Russian.
Includes detailed explanations of toxicity labels and ethical considerations regarding potential biases.

Maintenance & Community

Developed by Laura Hanu at Unitary.
Active development with recent updates in October 2021.
Codebase includes CI/CD pipelines for testing and linting.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, the presence of a LICENSE file (not detailed here) is typical for open-source projects. Users should verify the license for commercial use or closed-source integration.

Limitations & Caveats

Models may misclassify humorous or self-deprecating use of profanity as toxic.
Potential biases towards vulnerable minority groups exist, as noted by the developers.
Intended for research or aiding content moderators; fine-tuning on specific datasets is recommended.

Health Check

Last Commit

6 days ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

0

Star History

13 stars in the last 30 days

Explore Similar Projects

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

MentalLLaMA by SteveKGYang

Open-source LLM for interpretable mental health analysis

Created 2 years ago

Updated 1 year ago

fancy-nlp by boat-group

NLP toolkit for rapid prototyping and deployment

Created 6 years ago

Updated 3 years ago

SentimentAnalysis by barissayil

Sentiment analysis via fine-tuned transformer

Created 6 years ago

Updated 2 years ago

langtest by Pacific-AI-Corp

NLP testing SDK for model safety and effectiveness

Created 3 years ago

Updated 2 weeks ago

Starred by

Robert Stojnic

Robert Stojnic(Cocreator of Papers with Code).

finetune by IndicoDataSolutions

NLP finetuning library with scikit-learn style API

Created 7 years ago

Updated 2 months ago

text-classification-surveys by liqian-bio

Text classification resource survey, covering shallow/deep learning models

Created 5 years ago

Updated 3 years ago

Starred by

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow),

Tristan Hume

Tristan Hume(MTS at Anthropic), and

2 more.

texar-pytorch by asyml

PyTorch toolkit for NLP and text generation research

Created 6 years ago

Updated 3 years ago

training-fine-tuning-large-language-models-workshop-dhs2024 by dipanjanS

Workshop for training and fine-tuning large language models

Created 1 year ago

Updated 10 months ago

BERT-keras by Separius

Keras implementation for BERT and Transformer LM research

Created 7 years ago

Updated 6 years ago

Starred by

Johannes Hagemann

Johannes Hagemann(Cofounder of Prime Intellect),

Li Jiang

Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and

4 more.

mle-bench by openai

Benchmark for evaluating AI agents on machine learning engineering tasks

Created 1 year ago

Updated 3 weeks ago

Bert-Multi-Label-Text-Classification by lonePatient

PyTorch code for multi-label text classification

Created 7 years ago

Updated 2 years ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

François Chollet

François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), and

43 more.

spaCy by explosion

NLP library for production applications

Created 11 years ago

Updated 1 month ago

Feedback? Help us improve.