nagisa - Simplifying Japanese Word Segmentation and POS-Tagging with Neural Networks

Introducing Nagisa: A Japanese Word Segmentation and POS-Tagging Tool

Nagisa is a Python module designed to handle the complexities of Japanese word segmentation and part-of-speech (POS) tagging. This tool is ideal for developers and linguists who need an easy-to-use and efficient solution for processing Japanese language text with precision.

Key Features of Nagisa

Nagisa is built on cutting-edge technology, leveraging recurrent neural networks to provide accurate and reliable results. It incorporates advanced methodologies, including:

Word Segmentation: Utilizes both character-level and word-level features, enhancing its ability to accurately segment words within a text stream [池田+].
POS-Tagging: Implements tag dictionary information, aiding in high-quality tagging of parts of speech within a text [Inoue+].

Installation Requirements

Nagisa supports a wide range of Python versions. For Linux users, it works with Python versions 3.6 through 3.12. On macOS (Intel or M1/M2), it supports Python versions 3.9 through 3.12. Windows users can install Nagisa if they have Python 3.6, 3.7, or 3.8 (64bit), and it's also compatible with the Windows Subsystem for Linux (WSL). Installation is straightforward with:

pip install nagisa

Basic Usage

Nagisa simplifies the process of word segmentation and POS-tagging. Here's a basic example:

import nagisa

text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞

# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']

Post-Processing Functions

Nagisa offers the flexibility to filter and extract words based on specific POS tags:

# Filter the words of the specific POS tags.
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞

# Extract only nouns.
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞

# List available POS-tags
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']

User Dictionary Customization

Users can easily add custom entries to the dictionary for more precise word recognition:

text = "3月に見た「3月のライオン」"
print(nagisa.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3/名詞 月/名詞 の/助詞 ライオン/名詞 」/補助記号

new_tagger = nagisa.Tagger(single_word_list=['3月のライオン'])
print(new_tagger.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3月のライオン/名詞 」/補助記号

Training a Custom Model

Nagisa provides an accessible training method for those interested in creating a customized model for specific applications, like POS-tagging or Named Entity Recognition (NER). The dataset should be in TSV format with each line comprising a word and its associated tag:

$ cat sample.train
唯一	NOUN
の	ADP
趣味	NOUN
は	ADP
料理	NOUN
EOS

Train the model using the following command:

nagisa.fit(train_file="sample.train", dev_file="sample.dev", test_file="sample.test", model_name="sample")

Conclusion

Nagisa is a versatile tool providing a robust solution for Japanese word segmentation and POS tagging. Its simple installation process, ease of use, and flexibility make it a favorable choice for both developers and researchers looking to work with Japanese text. For further documentation and updates, users are encouraged to explore the official documentation.