AwesomeKorean_Data  by songys

Korean NLP Datasets

Created 5 years ago
891 stars

Top 40.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository curates and organizes links to Korean language datasets, primarily aimed at researchers and developers building end-to-end NLP models. It serves as a centralized resource to simplify data acquisition and exploration for various Korean NLP tasks, from morphological analysis to machine translation and sentiment analysis.

How It Works

The project compiles links to a wide array of Korean text and speech datasets, categorizing them by task (e.g., named entity recognition, question answering, summarization) and providing details on their provider, documentation, license, and redistribution terms. It also includes information on data volume and language. The repository aims to facilitate easier access to these resources, enabling users to quickly identify and download relevant data for their NLP projects.

Quick Start & Requirements

Highlighted Details

  • Comprehensive coverage of Korean NLP tasks, including morphology, syntax, NER, sentiment analysis, machine translation, and more.
  • Detailed metadata for each dataset, such as license type (Commercial, Academic, Unknown), redistribution rights, and data volume.
  • Inclusion of datasets from various sources like KLUE, KoBEST, KAIST, AIHub, and the National Institute of Korean Language.

Maintenance & Community

The repository has seen contributions and revisions, with significant updates noted in August 2020 and a move to the main repo in October 2020. It appears to be a community-driven effort to consolidate Korean language resources.

Licensing & Compatibility

Dataset licenses vary, including "rd" (Redistribution possible with or without modification), "no" (Redistribution not possible), and "unk" (Unknown). Users must check the specific license for each dataset to ensure compatibility with their intended use, especially for commercial applications.

Limitations & Caveats

Some datasets may have specific usage restrictions or require a formal application process, as indicated by terms like "academic use only" or the need for user registration and approval. The availability and format of data can also vary, with some links potentially leading to external sites requiring further steps.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrew Kane Andrew Kane(Author of pgvector), and
8 more.

awesome-nlp by keon

0.1%
18k
Curated list of NLP resources
Created 9 years ago
Updated 5 days ago
Feedback? Help us improve.