open-korean-instructions by HeegyuKim

Korean instruction datasets for language model training

Created 3 years ago

468 stars

Top 64.2% on SourcePulse

Project Summary

This repository aggregates publicly available Korean instruction datasets for training language models. It serves as a central hub for researchers and developers working on Korean NLP, offering a diverse collection of datasets generated through translation, AI augmentation, and human curation, facilitating the development of more capable Korean language models.

How It Works

The repository curates and lists various Korean instruction datasets, categorizing them by size, type (singleton/multi-turn), and content origin. Datasets are primarily sourced from English counterparts translated into Korean using tools like DeepL, Google Translate, or AI models, with some datasets generated directly from Korean sources or through AI-assisted augmentation. This approach leverages existing high-quality English datasets and adapts them for Korean, while also incorporating natively Korean-generated data.

Quick Start & Requirements

Datasets are available for download via Hugging Face.
No specific installation commands are provided; users download and utilize the datasets directly.
Requirements are standard for NLP tasks: Python environment and libraries for data handling.

Highlighted Details

Extensive collection of over 30 diverse Korean instruction datasets.
Includes datasets for various tasks: general instruction following, question answering, conversational AI, and specialized domains like medical and financial.
Features datasets specifically designed for Reinforcement Learning from Human Feedback (RLHF) and Reward Modeling (RM).
Provides links to evaluation benchmarks and leaderboards for Korean language models.

Maintenance & Community

The repository encourages community contributions via Pull Requests for new datasets.
Links to related projects and translation efforts are provided.

Licensing & Compatibility

Licenses vary per dataset; users must check individual dataset licenses.
Compatibility for commercial use depends on the specific dataset's license.

Limitations & Caveats

Dataset quality and licensing vary significantly across the collection, requiring careful individual review.
Some datasets are translations, which may introduce nuances or inaccuracies.
The repository is a curated list, not a unified, ready-to-use dataset; users must manage individual downloads and preprocessing.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

Starred by

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen),

Luca Soldaini

Luca Soldaini(Research Scientist at Ai2), and

1 more.

instruction-datasets by raunak-agarwal

Dataset list for instruction tuning of LLMs

Created 3 years ago

Updated 2 years ago

InstructionZoo by FreedomIntelligence

Instruction-tuning dataset collection for chat-based LLMs

Created 3 years ago

Updated 2 years ago

KoLLaVA by tabtoyou

Multimodal model for Korean visual instruction following

Created 3 years ago

Updated 1 year ago

sft_datasets by chaoswork

SFT datasets for instruction tuning

Created 3 years ago

Updated 3 years ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI).

awesome-instruction-datasets by jianzhnie

Curated list of instruction datasets for training ChatLLMs

Created 3 years ago

Updated 3 weeks ago

Starred by

Alexander Borzunov

Alexander Borzunov(Research Scientist at OpenAI) and

Philipp Schmid

Philipp Schmid(DevRel at Google DeepMind).

xmtf by bigscience-workshop

Code and data for crosslingual multitask finetuning research

Created 3 years ago

Updated 1 year ago

Starred by

Eugene Yan

Eugene Yan(AI Scientist at AWS) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

awesome-instruction-dataset by yaodongC

Dataset collection for instruction-tuning LLMs

Created 3 years ago

Updated 2 years ago

AwesomeKorean_Data by songys

Korean NLP Datasets

Created 6 years ago

Updated 3 weeks ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo) and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

TextBox by RUCAIBox

Text generation library with pre-trained language models

Created 5 years ago

Updated 3 years ago

KoELECTRA by monologg

Pretrained ELECTRA model for Korean language tasks

Created 6 years ago

Updated 2 years ago

Awesome-LLMs-Datasets by lmmlzn

LLM datasets survey for pre-training, fine-tuning, preference, evaluation, and NLP

Created 2 years ago

Updated 4 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and

1 more.

nlp_chinese_corpus by brightmart

Chinese NLP corpus for pre-training and language model tasks

Created 7 years ago

Updated 5 months ago

Feedback? Help us improve.