Discover and explore top open-source AI tools and projects—updated daily.
smilegate-aiKorean hate speech dataset for multi-label classification
Top 67.6% on SourcePulse
This repository provides the Korean UnSmile dataset, a multi-label, human-annotated dataset for identifying hate speech in Korean text. It is designed for researchers and developers working on natural language processing tasks, particularly in detecting and mitigating online hate speech, offering a valuable resource for training and evaluating models.
How It Works
The dataset defines hate speech as hostile remarks, mockery, or prejudice against specific social groups. It categorizes hate speech into distinct types, including gender/family, male, sexual minorities, race/nationality, age, region, religion, and other categories, alongside general profanity/abuse and clean text. Each sentence is multi-labeled by a team of annotators and reviewed by hate speech experts, ensuring a nuanced classification.
Quick Start & Requirements
datasets library: pip install datasetsfrom datasets import load_dataset
dataset = load_dataset('smilegate-ai/kor_unsmile')
datasets library.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The dataset's annotation is not guaranteed to be 100% accurate, and users are encouraged to report any tagging errors. The CC-BY-NC-ND 4.0 license restricts commercial use and derivative works without permission.
4 years ago
Inactive
facebookresearch
NVIDIA