Discover and explore top open-source AI tools and projects—updated daily.
Korean instruction datasets for language model training
Top 68.5% on SourcePulse
This repository aggregates publicly available Korean instruction datasets for training language models. It serves as a central hub for researchers and developers working on Korean NLP, offering a diverse collection of datasets generated through translation, AI augmentation, and human curation, facilitating the development of more capable Korean language models.
How It Works
The repository curates and lists various Korean instruction datasets, categorizing them by size, type (singleton/multi-turn), and content origin. Datasets are primarily sourced from English counterparts translated into Korean using tools like DeepL, Google Translate, or AI models, with some datasets generated directly from Korean sources or through AI-assisted augmentation. This approach leverages existing high-quality English datasets and adapts them for Korean, while also incorporating natively Korean-generated data.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
5 months ago
Inactive