AwesomeKorean_Data by songys

Korean NLP Datasets

Created 6 years ago

921 stars

Top 38.8% on SourcePulse

Project Summary

This repository curates and organizes links to Korean language datasets, primarily aimed at researchers and developers building end-to-end NLP models. It serves as a centralized resource to simplify data acquisition and exploration for various Korean NLP tasks, from morphological analysis to machine translation and sentiment analysis.

How It Works

The project compiles links to a wide array of Korean text and speech datasets, categorizing them by task (e.g., named entity recognition, question answering, summarization) and providing details on their provider, documentation, license, and redistribution terms. It also includes information on data volume and language. The repository aims to facilitate easier access to these resources, enabling users to quickly identify and download relevant data for their NLP projects.

Quick Start & Requirements

Installation: No specific installation is required as it's a curated list of links. Users need to visit the provided links to access and download the datasets.
Prerequisites: Access to the internet is required. Some datasets may require registration, agreement to terms of use, or specific software for downloading or processing.
Links:
- Main repository: https://github.com/songys/AwesomeKorean_Data
- English version: https://github.com/songys/Awesome-Korean-NLP
- Preprocessing and downloader links: https://ratsgo.github.io/embedding/preprocess.html
- Hugging Face Korean Datasets: https://github.com/songys/huggingface_KoreanDataset

Highlighted Details

Comprehensive coverage of Korean NLP tasks, including morphology, syntax, NER, sentiment analysis, machine translation, and more.
Detailed metadata for each dataset, such as license type (Commercial, Academic, Unknown), redistribution rights, and data volume.
Inclusion of datasets from various sources like KLUE, KoBEST, KAIST, AIHub, and the National Institute of Korean Language.

Maintenance & Community

The repository has seen contributions and revisions, with significant updates noted in August 2020 and a move to the main repo in October 2020. It appears to be a community-driven effort to consolidate Korean language resources.

Licensing & Compatibility

Dataset licenses vary, including "rd" (Redistribution possible with or without modification), "no" (Redistribution not possible), and "unk" (Unknown). Users must check the specific license for each dataset to ensure compatibility with their intended use, especially for commercial applications.

Limitations & Caveats

Some datasets may have specific usage restrictions or require a formal application process, as indicated by terms like "academic use only" or the need for user registration and approval. The availability and format of data can also vary, with some links potentially leading to external sites requiring further steps.

AwesomeKorean_Data by songys

Explore Similar Projects

open-korean-instructions by HeegyuKim

LMkor by kiyoungkim1

Awesome-Indonesia-NLP by irfnrdh

KoELECTRA by monologg

NLP_bahasa_resources by louisowen6

pororo by kakaobrain

The-NLP-Pandect by ivan-bilan

underthesea by undertheseanlp

NLP-Knowledge-Graph by lihanghang

KoBERT by SKTBrain

nlp_chinese_corpus by brightmart

awesome-nlp by keon