Introducing Nagisa: A Japanese Word Segmentation and POS-Tagging Tool
Nagisa is a Python module designed to handle the complexities of Japanese word segmentation and part-of-speech (POS) tagging. This tool is ideal for developers and linguists who need an easy-to-use and efficient solution for processing Japanese language text with precision.
Key Features of Nagisa
Nagisa is built on cutting-edge technology, leveraging recurrent neural networks to provide accurate and reliable results. It incorporates advanced methodologies, including:
- Word Segmentation: Utilizes both character-level and word-level features, enhancing its ability to accurately segment words within a text stream [池田+].
- POS-Tagging: Implements tag dictionary information, aiding in high-quality tagging of parts of speech within a text [Inoue+].
Installation Requirements
Nagisa supports a wide range of Python versions. For Linux users, it works with Python versions 3.6 through 3.12. On macOS (Intel or M1/M2), it supports Python versions 3.9 through 3.12. Windows users can install Nagisa if they have Python 3.6, 3.7, or 3.8 (64bit), and it's also compatible with the Windows Subsystem for Linux (WSL). Installation is straightforward with:
pip install nagisa
Basic Usage
Nagisa simplifies the process of word segmentation and POS-tagging. Here's a basic example:
import nagisa
text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞
# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']
# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']
Post-Processing Functions
Nagisa offers the flexibility to filter and extract words based on specific POS tags:
# Filter the words of the specific POS tags.
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞
# Extract only nouns.
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞
# List available POS-tags
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']
User Dictionary Customization
Users can easily add custom entries to the dictionary for more precise word recognition:
text = "3月に見た「3月のライオン」"
print(nagisa.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3/名詞 月/名詞 の/助詞 ライオン/名詞 」/補助記号
new_tagger = nagisa.Tagger(single_word_list=['3月のライオン'])
print(new_tagger.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3月のライオン/名詞 」/補助記号
Training a Custom Model
Nagisa provides an accessible training method for those interested in creating a customized model for specific applications, like POS-tagging or Named Entity Recognition (NER). The dataset should be in TSV format with each line comprising a word and its associated tag:
$ cat sample.train
唯一 NOUN
の ADP
趣味 NOUN
は ADP
料理 NOUN
EOS
Train the model using the following command:
nagisa.fit(train_file="sample.train", dev_file="sample.dev", test_file="sample.test", model_name="sample")
Conclusion
Nagisa is a versatile tool providing a robust solution for Japanese word segmentation and POS tagging. Its simple installation process, ease of use, and flexibility make it a favorable choice for both developers and researchers looking to work with Japanese text. For further documentation and updates, users are encouraged to explore the official documentation.