KoBERT by SKTBrain

Korean BERT for language tasks

Created 6 years ago

1,396 stars

Top 28.8% on SourcePulse

Project Summary

KoBERT is a Korean BERT model pre-trained on a large Korean corpus, offering improved performance over multilingual BERT for Korean NLP tasks. It is designed for researchers and developers working with Korean language processing, providing a strong foundation for fine-tuning on specific downstream tasks like sentiment analysis and named entity recognition.

How It Works

KoBERT is based on the BERT architecture, featuring 12 layers, 768 hidden units, and 12 attention heads. It uses a SentencePiece tokenizer trained on Korean Wikipedia, resulting in a vocabulary size of 8,002 tokens. The model is trained on approximately 54 million sentences from Korean Wikipedia, aiming for better Korean language understanding and efficiency with fewer parameters than BERT base.

Quick Start & Requirements

Install: pip install git+https://git@github.com/SKTBrain/KoBERT.git@master
Prerequisites: Python, PyTorch, ONNX, MXNet-Gluon (for specific integrations). GPU is recommended for training.
Resources: Requires downloading model weights.
Docs: Huggingface transformers API, Colab example

Highlighted Details

Achieves 90.1% accuracy on the Naver Sentiment Analysis dataset, outperforming BERT base multilingual cased (87.5%).
Provides pre-trained models compatible with PyTorch, ONNX, and MXNet-Gluon.
Includes a SentencePiece tokenizer specifically trained for Korean.
Demonstrates successful application in Named Entity Recognition (NER) tasks using BERT-CRF.

Maintenance & Community

Active releases with version history available.
Issues can be reported via GitHub Issues.

Licensing & Compatibility

License: Apache-2.0.
Compatibility: Permissive for commercial use and integration with closed-source projects, provided license terms are followed.

Limitations & Caveats

The README mentions that the model is returned in eval() mode by default, requiring explicit switching to train() mode for fine-tuning.

Health Check

Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

0

Star History

6 stars in the last 30 days

Explore Similar Projects

PERT by ymcui

Pre-training method for BERT using a permuted language model

Created 4 years ago

Updated 6 months ago

parsbert by hooshvare

Persian language model based on Google's BERT architecture

Created 5 years ago

Updated 2 years ago

LMkor by kiyoungkim1

Korean language models for NLP tasks

Created 5 years ago

Updated 3 years ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face).

bert-japanese by cl-tohoku

Pretrained BERT models for Japanese text

Created 6 years ago

Updated 1 year ago

fastHan by fastnlp

NLP toolkit for Chinese, like spaCy

Created 6 years ago

Updated 2 years ago

BERT_Chinese_Classification by renxingkai

BERT fine-tuning example for Chinese sentiment classification

Created 6 years ago

Updated 6 years ago

KoGPT2 by SKT-AI

Korean GPT-2 model for text generation

Created 6 years ago

Updated 1 year ago

nlp-notebook by jasoncao11

NLP toolkit for common tasks, implemented in PyTorch

Created 4 years ago

Updated 2 years ago

sentiment_analysis_fine_grain by brightmart

Multi-label classification with BERT for fine-grained sentiment analysis

Created 7 years ago

Updated 7 years ago

NLP-BERT--ChineseVersion by Y1ran

PyTorch BERT implementation for Chinese readers, mirroring the original Google AI paper

Created 7 years ago

Updated 7 years ago

How-to-use-Transformers by jsksxs360

Tutorial code for quick-start with Transformers library

Created 3 years ago

Updated 1 year ago

tensorflow-nlp-tutorial by ukairia777

TensorFlow 2.0 tutorials for NLP tasks

Created 4 years ago

Updated 6 months ago

Feedback? Help us improve.