LMkor by kiyoungkim1

Korean language models for NLP tasks

Created 5 years ago

399 stars

Top 72.4% on SourcePulse

Project Summary

This repository provides a suite of pre-trained language models specifically for the Korean language, addressing the gap in high-performance NLP resources for non-English languages. It offers encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5, BERTSHARED) architectures, making them suitable for researchers and developers working on Korean NLP tasks.

How It Works

The models are trained on a diverse 70GB Korean text corpus, including cleaned web data, blogs, comments, and reviews, aiming for robust understanding of informal language. Tokenization is unified across all models using Huggingface's WordPiece tokenizer with a 42,000-token vocabulary. Architectures like BERT utilize whole-word-masking, while BERTSHARED implements parameter sharing between encoder and decoder for efficient sequence-to-sequence tasks.

Quick Start & Requirements

Models can be easily integrated using the Huggingface Transformers library for both PyTorch and TensorFlow.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("kykim/bert-kor-base")
model = AutoModel.from_pretrained("kykim/bert-kor-base")

No specific hardware requirements beyond standard Python environments are mentioned, though training large models typically benefits from GPUs.

Highlighted Details

Offers a variety of Korean NLP models including BERT, GPT, T5, and Funnel-Transformer variants.
Models are trained on a substantial 70GB Korean corpus, including informal text data.
Provides competitive benchmark results across several Korean NLP tasks like sentiment analysis, NER, and NLI.
BERTSHARED model enables efficient seq2seq tasks by sharing encoder-decoder parameters.

Maintenance & Community

The project was last updated in January 2021 with the addition of BERTSHARED and GPT3 models. Further community engagement or roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The pretrained models are distributed under the Apache-2.0 License. Commercial use is permitted via an MOU; inquiries can be directed to kykim@artificial.sc.

Limitations & Caveats

The models are pre-trained and may require fine-tuning on specific downstream tasks for optimal performance. The project's last update was in early 2021, suggesting potential for newer architectures or improved training methodologies to have emerged since.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days