korean_unsmile_dataset  by smilegate-ai

Korean hate speech dataset for multi-label classification

created 3 years ago
441 stars

Top 69.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the Korean UnSmile dataset, a multi-label, human-annotated dataset for identifying hate speech in Korean text. It is designed for researchers and developers working on natural language processing tasks, particularly in detecting and mitigating online hate speech, offering a valuable resource for training and evaluating models.

How It Works

The dataset defines hate speech as hostile remarks, mockery, or prejudice against specific social groups. It categorizes hate speech into distinct types, including gender/family, male, sexual minorities, race/nationality, age, region, religion, and other categories, alongside general profanity/abuse and clean text. Each sentence is multi-labeled by a team of annotators and reviewed by hate speech experts, ensuring a nuanced classification.

Quick Start & Requirements

  • Install via Hugging Face datasets library: pip install datasets
  • Load the dataset:
    from datasets import load_dataset
    dataset = load_dataset('smilegate-ai/kor_unsmile')
    
  • Requires Python and the datasets library.
  • Baseline models are available for testing and training, compatible with Hugging Face Transformers.
  • Official documentation and tutorials are available on the GitHub repository.

Highlighted Details

  • Contains 18,742 total sentences, with 10,139 labeled as hate speech.
  • Features multi-label classification for hate speech categories.
  • Includes a baseline BERT model for sequence classification, demonstrating inference capabilities.
  • Provides detailed performance metrics (precision, recall, F1-score) for the baseline model across all categories.

Maintenance & Community

  • Developed and produced by Smilegate AI.
  • Data tagging and review conducted by Underscore, a data-driven knowledge content startup.
  • Citations provided for both the dataset and related research papers.
  • A dataset introduction video is available.

Licensing & Compatibility

  • Source code and baseline models are released under the Apache 2.0 license.
  • The dataset itself is licensed under CC-BY-NC-ND 4.0.
  • Commercial use of the dataset requires explicit inquiry to Smilegate AI.

Limitations & Caveats

The dataset's annotation is not guaranteed to be 100% accurate, and users are encouraged to report any tagging errors. The CC-BY-NC-ND 4.0 license restricts commercial use and derivative works without permission.

Health Check
Last commit

3 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.