KR-BERT - Advanced Korean Language Model Utilizing Unique Tokenization Methods

Introducing KR-BERT: A Specialized Korean Language Model

KR-BERT is a Korean-specific, small-scale BERT model developed by the Computational Linguistics Lab at Seoul National University. This project focuses on building a language model that addresses the unique characteristics of the Korean language. Presented in the paper titled KR-BERT: A Small-Scale Korean-Specific Language Model, KR-BERT offers comparable or improved performance compared to other models, especially in tasks involving the Korean language.

Vocabulary, Parameters, and Data

KR-BERT stands out due to its tailored vocabulary and parameter settings crafted specifically for Korean text. Here's a quick comparison with other notable models:

	Multilingual BERT (Google)	KorBERT (ETRI)	KoBERT (SKT)	KR-BERT character	KR-BERT sub-character
Vocab Size	119,547	30,797	8,002	16,424	12,367
Parameter Size	167,356,416	109,973,391	92,186,880	99,265,066	96,145,233
Data Size	- (104 languages)	23GB	- (233M words)	2.47GB	2.47GB

In terms of performance, KR-BERT models deliver impressive Masked LM Accuracy:

KoBERT: 0.750
KR-BERT Character Model: 0.779
KR-BERT Sub-character Model: 0.769

Sub-character Representation

Hangul, the script used in Korean, can be broken into smaller components or sub-characters. KR-BERT capitalizes on this by creating models that understand both full characters and sub-characters. The process uses coding techniques to prepare data, making it suitable for sub-character analysis.

from transformers import BertTokenizer
from unicodedata import normalize

# Convert a string into sub-char format
def to_subchar(string):
    return normalize('NFKD', string)

sentence = '토크나이저 예시입니다.'
print(tokenizer_krbert.tokenize(to_subchar(sentence)))

Tokenization Technique

KR-BERT employs a BidirectionalWordPiece Tokenizer. Unlike typical tokenizers, which go in one direction, this method reduces search costs by considering token candidates from both directions. It then selects the candidate occurring more frequently.

	Multilingual BERT	KorBERT	KoBERT	KR-BERT Character	KR-BERT Sub-character
Example (refrigerator)	냉#장#고	냉#장#고	냉#장#고	냉장고	냉장고

Models Available

KR-BERT offers multiple versions suitable for both TensorFlow and PyTorch users. These include various tokenizers and character models:

	Character	Sub-character
WordPiece	Available	Available
BidirectionalWordPiece	Available	Available

Implementation Requirements

To utilize KR-BERT models, ensure you have the following requirements installed:

transformers == 2.1.1
tensorflow < 2.0

Application: Naver Sentiment Movie Corpus (NSMC)

KR-BERT models are tested on specific downstream tasks such as sentiment analysis using the Naver Sentiment Movie Corpus, yielding high accuracy:

Model	Accuracy (pytorch)	Accuracy (tensorflow)
KR-BERT Character Bidirectional	89.38	90.10

KR-BERT demonstrates proficient results in handling Korean-specific tasks, offering choice in model complexity and representation.

Citation

For those utilizing KR-BERT in academic or commercial projects, referencing the original work and its authors ensures proper credit:

@article{lee2020krbert,
    title={KR-BERT: A Small-Scale Korean-Specific Language Model},
    author={Sangah Lee et al.},
    year={2020},
    journal={ArXiv},
    volume={abs/2008.03979}
}

For further inquiries or assistance regarding KR-BERT, the project team can be contacted via [email protected].