KoBERT  by SKTBrain

Korean BERT for language tasks

created 5 years ago
1,371 stars

Top 30.0% on sourcepulse

GitHubView on GitHub
Project Summary

KoBERT is a Korean BERT model pre-trained on a large Korean corpus, offering improved performance over multilingual BERT for Korean NLP tasks. It is designed for researchers and developers working with Korean language processing, providing a strong foundation for fine-tuning on specific downstream tasks like sentiment analysis and named entity recognition.

How It Works

KoBERT is based on the BERT architecture, featuring 12 layers, 768 hidden units, and 12 attention heads. It uses a SentencePiece tokenizer trained on Korean Wikipedia, resulting in a vocabulary size of 8,002 tokens. The model is trained on approximately 54 million sentences from Korean Wikipedia, aiming for better Korean language understanding and efficiency with fewer parameters than BERT base.

Quick Start & Requirements

  • Install: pip install git+https://git@github.com/SKTBrain/KoBERT.git@master
  • Prerequisites: Python, PyTorch, ONNX, MXNet-Gluon (for specific integrations). GPU is recommended for training.
  • Resources: Requires downloading model weights.
  • Docs: Huggingface transformers API, Colab example

Highlighted Details

  • Achieves 90.1% accuracy on the Naver Sentiment Analysis dataset, outperforming BERT base multilingual cased (87.5%).
  • Provides pre-trained models compatible with PyTorch, ONNX, and MXNet-Gluon.
  • Includes a SentencePiece tokenizer specifically trained for Korean.
  • Demonstrates successful application in Named Entity Recognition (NER) tasks using BERT-CRF.

Maintenance & Community

  • Active releases with version history available.
  • Issues can be reported via GitHub Issues.

Licensing & Compatibility

  • License: Apache-2.0.
  • Compatibility: Permissive for commercial use and integration with closed-source projects, provided license terms are followed.

Limitations & Caveats

The README mentions that the model is returned in eval() mode by default, requiring explicit switching to train() mode for fine-tuning.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.