Introducing KR-BERT: A Specialized Korean Language Model
KR-BERT is a Korean-specific, small-scale BERT model developed by the Computational Linguistics Lab at Seoul National University. This project focuses on building a language model that addresses the unique characteristics of the Korean language. Presented in the paper titled KR-BERT: A Small-Scale Korean-Specific Language Model, KR-BERT offers comparable or improved performance compared to other models, especially in tasks involving the Korean language.
Vocabulary, Parameters, and Data
KR-BERT stands out due to its tailored vocabulary and parameter settings crafted specifically for Korean text. Here's a quick comparison with other notable models:
Multilingual BERT (Google) | KorBERT (ETRI) | KoBERT (SKT) | KR-BERT character | KR-BERT sub-character | |
---|---|---|---|---|---|
Vocab Size | 119,547 | 30,797 | 8,002 | 16,424 | 12,367 |
Parameter Size | 167,356,416 | 109,973,391 | 92,186,880 | 99,265,066 | 96,145,233 |
Data Size | - (104 languages) | 23GB | - (233M words) | 2.47GB | 2.47GB |
In terms of performance, KR-BERT models deliver impressive Masked LM Accuracy:
- KoBERT: 0.750
- KR-BERT Character Model: 0.779
- KR-BERT Sub-character Model: 0.769
Sub-character Representation
Hangul, the script used in Korean, can be broken into smaller components or sub-characters. KR-BERT capitalizes on this by creating models that understand both full characters and sub-characters. The process uses coding techniques to prepare data, making it suitable for sub-character analysis.
from transformers import BertTokenizer
from unicodedata import normalize
# Convert a string into sub-char format
def to_subchar(string):
return normalize('NFKD', string)
sentence = '토크나이저 예시입니다.'
print(tokenizer_krbert.tokenize(to_subchar(sentence)))
Tokenization Technique
KR-BERT employs a BidirectionalWordPiece Tokenizer. Unlike typical tokenizers, which go in one direction, this method reduces search costs by considering token candidates from both directions. It then selects the candidate occurring more frequently.
Multilingual BERT | KorBERT | KoBERT | KR-BERT Character | KR-BERT Sub-character | |
---|---|---|---|---|---|
Example (refrigerator) | 냉#장#고 | 냉#장#고 | 냉#장#고 | 냉장고 | 냉장고 |
Models Available
KR-BERT offers multiple versions suitable for both TensorFlow and PyTorch users. These include various tokenizers and character models:
Character | Sub-character | |
---|---|---|
WordPiece | Available | Available |
BidirectionalWordPiece | Available | Available |
Implementation Requirements
To utilize KR-BERT models, ensure you have the following requirements installed:
transformers == 2.1.1
tensorflow < 2.0
Application: Naver Sentiment Movie Corpus (NSMC)
KR-BERT models are tested on specific downstream tasks such as sentiment analysis using the Naver Sentiment Movie Corpus, yielding high accuracy:
Model | Accuracy (pytorch) | Accuracy (tensorflow) |
---|---|---|
KR-BERT Character Bidirectional | 89.38 | 90.10 |
KR-BERT demonstrates proficient results in handling Korean-specific tasks, offering choice in model complexity and representation.
Citation
For those utilizing KR-BERT in academic or commercial projects, referencing the original work and its authors ensures proper credit:
@article{lee2020krbert,
title={KR-BERT: A Small-Scale Korean-Specific Language Model},
author={Sangah Lee et al.},
year={2020},
journal={ArXiv},
volume={abs/2008.03979}
}
For further inquiries or assistance regarding KR-BERT, the project team can be contacted via [email protected].