korean_unsmile_dataset by smilegate-ai

Korean hate speech dataset for multi-label classification

created 3 years ago

441 stars

Top 69.1% on sourcepulse

Project Summary

This repository provides the Korean UnSmile dataset, a multi-label, human-annotated dataset for identifying hate speech in Korean text. It is designed for researchers and developers working on natural language processing tasks, particularly in detecting and mitigating online hate speech, offering a valuable resource for training and evaluating models.

How It Works

The dataset defines hate speech as hostile remarks, mockery, or prejudice against specific social groups. It categorizes hate speech into distinct types, including gender/family, male, sexual minorities, race/nationality, age, region, religion, and other categories, alongside general profanity/abuse and clean text. Each sentence is multi-labeled by a team of annotators and reviewed by hate speech experts, ensuring a nuanced classification.

Quick Start & Requirements

Install via Hugging Face datasets library: pip install datasets

Load the dataset:

from datasets import load_dataset
dataset = load_dataset('smilegate-ai/kor_unsmile')

Requires Python and the datasets library.
Baseline models are available for testing and training, compatible with Hugging Face Transformers.
Official documentation and tutorials are available on the GitHub repository.

Highlighted Details

Contains 18,742 total sentences, with 10,139 labeled as hate speech.
Features multi-label classification for hate speech categories.
Includes a baseline BERT model for sequence classification, demonstrating inference capabilities.
Provides detailed performance metrics (precision, recall, F1-score) for the baseline model across all categories.

Maintenance & Community

Developed and produced by Smilegate AI.
Data tagging and review conducted by Underscore, a data-driven knowledge content startup.
Citations provided for both the dataset and related research papers.
A dataset introduction video is available.

Licensing & Compatibility

Source code and baseline models are released under the Apache 2.0 license.
The dataset itself is licensed under CC-BY-NC-ND 4.0.
Commercial use of the dataset requires explicit inquiry to Smilegate AI.

Limitations & Caveats

The dataset's annotation is not guaranteed to be 100% accurate, and users are encouraged to report any tagging errors. The CC-BY-NC-ND 4.0 license restricts commercial use and derivative works without permission.

Health Check

Last commit

3 years ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 90 days