Korean hate speech dataset for multi-label classification
Top 69.1% on sourcepulse
This repository provides the Korean UnSmile dataset, a multi-label, human-annotated dataset for identifying hate speech in Korean text. It is designed for researchers and developers working on natural language processing tasks, particularly in detecting and mitigating online hate speech, offering a valuable resource for training and evaluating models.
How It Works
The dataset defines hate speech as hostile remarks, mockery, or prejudice against specific social groups. It categorizes hate speech into distinct types, including gender/family, male, sexual minorities, race/nationality, age, region, religion, and other categories, alongside general profanity/abuse and clean text. Each sentence is multi-labeled by a team of annotators and reviewed by hate speech experts, ensuring a nuanced classification.
Quick Start & Requirements
datasets
library: pip install datasets
from datasets import load_dataset
dataset = load_dataset('smilegate-ai/kor_unsmile')
datasets
library.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The dataset's annotation is not guaranteed to be 100% accurate, and users are encouraged to report any tagging errors. The CC-BY-NC-ND 4.0 license restricts commercial use and derivative works without permission.
3 years ago
1 day