This repository aggregates publicly available Korean instruction datasets for training language models. It serves as a central hub for researchers and developers working on Korean NLP, offering a diverse collection of datasets generated through translation, AI augmentation, and human curation, facilitating the development of more capable Korean language models.
How It Works
The repository curates and lists various Korean instruction datasets, categorizing them by size, type (singleton/multi-turn), and content origin. Datasets are primarily sourced from English counterparts translated into Korean using tools like DeepL, Google Translate, or AI models, with some datasets generated directly from Korean sources or through AI-assisted augmentation. This approach leverages existing high-quality English datasets and adapts them for Korean, while also incorporating natively Korean-generated data.
Quick Start & Requirements
- Datasets are available for download via Hugging Face.
- No specific installation commands are provided; users download and utilize the datasets directly.
- Requirements are standard for NLP tasks: Python environment and libraries for data handling.
Highlighted Details
- Extensive collection of over 30 diverse Korean instruction datasets.
- Includes datasets for various tasks: general instruction following, question answering, conversational AI, and specialized domains like medical and financial.
- Features datasets specifically designed for Reinforcement Learning from Human Feedback (RLHF) and Reward Modeling (RM).
- Provides links to evaluation benchmarks and leaderboards for Korean language models.
Maintenance & Community
- The repository encourages community contributions via Pull Requests for new datasets.
- Links to related projects and translation efforts are provided.
Licensing & Compatibility
- Licenses vary per dataset; users must check individual dataset licenses.
- Compatibility for commercial use depends on the specific dataset's license.
Limitations & Caveats
- Dataset quality and licensing vary significantly across the collection, requiring careful individual review.
- Some datasets are translations, which may introduce nuances or inaccuracies.
- The repository is a curated list, not a unified, ready-to-use dataset; users must manage individual downloads and preprocessing.