KoAlpaca is an open-source language model project focused on understanding and responding to Korean instructions. It offers various models based on LLAMA and Polyglot-ko backbones, catering to researchers and developers working with Korean NLP tasks. The project provides fine-tuned models and datasets, enabling users to build Korean-specific conversational AI and instruction-following systems.
How It Works
KoAlpaca models are fine-tuned using the Stanford Alpaca methodology, adapting instruction-following techniques to Korean. The project utilizes both full fine-tuning and LoRA methods, leveraging large Korean datasets derived from sources like Naver Knowledge iN and translated Alpaca data. This approach aims to improve Korean language understanding and response generation quality compared to models trained solely on English data.
Quick Start & Requirements
- Install core libraries:
pip install -U torch transformers tokenizers accelerate safetensors
- Inference example provided using Hugging Face
pipeline
.
- Training examples for QLoRA and FSDP are available, requiring significant GPU resources (e.g., multiple A100s or RTX 3090/4090s).
- Official Hugging Face models are available for direct use.
Highlighted Details
- Offers models based on Polyglot-ko (5.8B, 12.8B) and LLAMA (7B, 13B, 30B, 65B) backbones.
- Provides both full fine-tuned and LoRA-adapted weights.
- Includes a Korean instruction-following dataset (v1.1) generated from Naver Knowledge iN.
- Demonstrates improved performance on NSMC benchmark compared to base Polyglot-ko.
Maintenance & Community
- Project activity and updates are logged in the README.
- Links to Hugging Face repositories for models are provided.
- No explicit community channels (Discord/Slack) are mentioned.
Licensing & Compatibility
- KoAlpaca models are typically released under licenses compatible with their base models (e.g., LLaMA's license).
- The dataset is available for research purposes.
- Commercial use depends on the underlying base model licenses.
Limitations & Caveats
- LLAMA-based models may exhibit lower Korean performance due to insufficient Korean pre-training.
- Training and inference for larger models require substantial GPU memory and compute.
- The web demo service has been discontinued as of May 2024.