embedding by ratsgo

Embedding tutorials for Korean NLP

Created 6 years ago

466 stars

Top 65.2% on SourcePulse

Project Summary

This repository provides tutorials and code for various embedding techniques, with a focus on Korean natural language processing. It's designed for researchers and practitioners looking to understand and implement word and sentence embeddings, from traditional methods like Word2Vec to modern transformer-based models like BERT. The project aims to demystify the process of creating and fine-tuning embeddings for Korean text.

How It Works

The project covers a spectrum of embedding methodologies, including Latent Semantic Analysis, Word2Vec, GloVe, FastText, Swivel for word-level, and weighted embeddings, LSA, LDA, Doc2Vec, ELMo, and BERT for sentence-level. It emphasizes corpus preprocessing using tools like KoNLPy, Khaiii, soynlp, and sentencepiece, and demonstrates fine-tuning on tasks like sentiment classification using the Naver Sentiment Movie Corpus (NSMC). The code is structured to facilitate experimentation with different embedding models and fine-tuning strategies.

Quick Start & Requirements

Install/Run: Docker is the recommended environment. Refer to the project's environment documentation for details: http://ratsgo.github.io/embedding/environment.html
Prerequisites: TensorFlow 1.12.0 is the base requirement. Specific package versions are critical. CPU and GPU environments have different configurations.
Resources: Docker setup is recommended for optimal environment configuration.

Highlighted Details

Covers both word-level (LSA, Word2Vec, GloVe, FastText, Swivel) and sentence-level (ELMo, BERT) embedding techniques.
Includes practical fine-tuning examples using the Naver Sentiment Movie Corpus (NSMC) for sentiment classification.
Provides core code for various embedding models (BERT, ELMo, Swivel, XLNet) and utility scripts for preprocessing, training, evaluation, and visualization.
Offers Dockerfiles for both CPU and GPU environments to ensure reproducible setups.

Maintenance & Community

The project is associated with the author's book on embeddings, suggesting a structured and curated learning experience. Further community or maintenance details are not explicitly highlighted in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project relies on specific, older versions of dependencies, notably TensorFlow 1.12.0, which may pose compatibility challenges with current deep learning ecosystems. The focus is primarily on Korean corpora, and performance on other languages is not discussed.

Health Check

Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days