Kiwi: Korean Intelligent Word Identifier
Kiwi is an advanced Korean morphological analyzer designed to provide fast and versatile linguistic processing for anyone interested in handling the Korean language. It is an open-source library that welcomes anyone from developers to linguists to use and contribute to its development. The core library is developed in C++ and is adaptable across various programming languages, making it highly accessible and functional.
Key Features
Kiwi supports morphological analysis based on the Sejong part-of-speech tagging system, utilizing Sejong and Modu corpora for model training. It boasts a robust 87% accuracy on web texts and up to 94% on literary texts in analyzing Korean sentences. From version 0.13.0, Kiwi includes a feature for automated correction of minor typos, enhancing its usability.
The library is noted for its speed, outperforming many other Korean morphological analyzers. Its built-in lightweight language models enable it to disambiguate complex linguistic structures accurately, especially when using the SBG model. Furthermore, Kiwi provides convenience features such as sentence splitting, which are helpful for comprehensive text analysis.
Multithreading is supported at the library level, making it possible to efficiently process large volumes of text using multiple cores. Kiwi offers small, medium, and large models, allowing users to select the appropriate model size that fits their computational capabilities and accuracy requirements.
Installation and Use
C++ Library
Kiwi can be compiled and used across different systems like Windows, Linux, and macOS. Precompiled binaries are available for download from the Kiwi GitHub releases page. Detailed instructions are provided for compiling the library using tools like Visual Studio on Windows or cmake on Linux.
Other Programmatic Interfaces
- C API: Refer to the
include/kiwi/capi.h
for implementation details. - C# Wrapper: A GUI application is available in C#, suitable for users without programming skills. It can be downloaded from the Kiwi GUI GitHub page.
- Python3 Wrapper: Known as Kiwipiepy, this API can be found on GitHub.
- Java Wrapper: KiwiJava is available for Java 1.8 and above, providing Java bindings.
- R Wrapper: Developed by mrchypark, the R wrapper is named elbird.
- GO Wrapper: Created by the codingpot community, the Go language wrapper is called kiwigo.
- Web Assembly: A Web Assembly binding has been contributed by RicBent for Javascript/Typescript.
Performance
Kiwi is built with a focus on performance and adaptability. It offers three different model sizes—small, medium, and large—each optimized for a specific type of text or computational resource. In varied text scenarios, Kiwi demonstrates impressive accuracy rates, ensuring precise analysis across different domains like web, news articles, and literature.
Understanding Part-of-Speech Tags
Kiwi utilizes an enhanced version of the Sejong POS tag set, incorporating additional or modified tags for better language representation. These tags categorize words into different parts of speech, including nouns, verbs, adjectives, and various kinds of particles, among others.
Continuous Improvement and Support
Kiwi is continually updated with contributions from its users and developers. Community interest and involvement are crucial for its ongoing development and success. Users are encouraged to check the release notes for the latest updates and improvements.
Kiwi exemplifies a blend of advanced language processing capabilities, user accessibility, and an open-source collaborative spirit, making it an invaluable tool for anyone working with the Korean language.