Korean language models for NLP tasks
Top 73.8% on sourcepulse
This repository provides a suite of pre-trained language models specifically for the Korean language, addressing the gap in high-performance NLP resources for non-English languages. It offers encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5, BERTSHARED) architectures, making them suitable for researchers and developers working on Korean NLP tasks.
How It Works
The models are trained on a diverse 70GB Korean text corpus, including cleaned web data, blogs, comments, and reviews, aiming for robust understanding of informal language. Tokenization is unified across all models using Huggingface's WordPiece tokenizer with a 42,000-token vocabulary. Architectures like BERT utilize whole-word-masking, while BERTSHARED implements parameter sharing between encoder and decoder for efficient sequence-to-sequence tasks.
Quick Start & Requirements
Models can be easily integrated using the Huggingface Transformers library for both PyTorch and TensorFlow.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("kykim/bert-kor-base")
model = AutoModel.from_pretrained("kykim/bert-kor-base")
No specific hardware requirements beyond standard Python environments are mentioned, though training large models typically benefits from GPUs.
Highlighted Details
Maintenance & Community
The project was last updated in January 2021 with the addition of BERTSHARED and GPT3 models. Further community engagement or roadmap details are not explicitly provided in the README.
Licensing & Compatibility
The pretrained models are distributed under the Apache-2.0 License. Commercial use is permitted via an MOU; inquiries can be directed to kykim@artificial.sc.
Limitations & Caveats
The models are pre-trained and may require fine-tuning on specific downstream tasks for optimal performance. The project's last update was in early 2021, suggesting potential for newer architectures or improved training methodologies to have emerged since.
2 years ago
Inactive