Introduction to Jcseg
Jcseg is a lightweight Chinese word segmentation tool that is developed based on the mmseg algorithm. It not only performs segmentation but also includes features such as keyword extraction, key phrase extraction, key sentence extraction, and automatic article summarization. A Jetty-based web server is provided, facilitating HTTP calls directly from various programming languages. Furthermore, Jcseg offers the latest version support for Lucene, Solr, and Elasticsearch/OpenSearch segmentation interfaces. A configuration file, jcseg.properties, allows quick setup tailored to different scenarios, such as setting the maximum match word length, enabling Chinese name recognition, and adding synonyms or pinyin.
Core Features of Jcseg
- Chinese Segmentation: Utilizes mmseg algorithm along with Jcseg's proprietary optimization, offering seven segmentation modes.
- Keyword Extraction: Based on the TextRank algorithm.
- Key Phrase Extraction: Also utilizes the TextRank algorithm.
- Key Sentence Extraction: Another feature powered by the TextRank algorithm.
- Automatic Article Summarization: Employs a combination of BM25 and TextRank algorithm.
- Part-of-Speech Tagging: Uses a lexicon and rudimentary statistical disambiguation, though not recommended where high accuracy is essential.
- Named Entity Recognition: Identifies various entities such as email addresses, URLs, mobile numbers, locations, personal names, currency, and more using a lexicon and statistical disambiguation.
- RESTful API: Embedded Jetty server offers high-performance HTTP interfaces with standardized JSON output for easy access across different languages.
Chinese Word Segmentation in Jcseg
Segmentation Modes
- Simple Mode: Fast FMM algorithm suitable for speed-critical scenarios.
- Complex Mode: Offers high disambiguation with an accuracy rate of 98.41%.
- Detect Mode: Returns only existing lexicon entries, suitable for particular applications.
- Most Mode: Fine-grained segmentation designed for retrieval.
- Delimiter Mode: Segments entries based on a given character, default is space, useful for specific applications.
- NLP Mode: An extension of the complex mode with specialized recognition extensions for emails, mobile numbers, URLs, etc.
- N-gram Mode: A universal n-gram segmentation for CJK and Latin characters.
Features of Segmentation
- Supports custom lexicons with directory-based management.
- Allows the mixing of simplified and traditional Chinese in lexicons.
- Fusion of synonyms and pinyin add-ons for enhanced text analysis and retrieval.
- Recognizes Chinese numbers and fractions and converts them into Arabic numerals.
- Supports mixed-language word recognition and segmentation, including complex English terms.
- Provides extra features like intelligent segmentation of special characters and punctuation, stop word filtering, and automatic reloading of updated lexicons.
Quick Experience of Jcseg
Users can test Jcseg from the terminal by compiling the code and running the Jcseg core jar file. They can experiment with different segmentation algorithms by entering specific commands, such as :seg_mode
, :keywords
, :keyphrase
, etc., in the terminal.
Jcseg Maven Repository
To include Jcseg in a Maven project, users can add dependencies for core functionalities and servers like jcseg-core, jcseg-analyzer, jcseg-elasticsearch, jcseg-opensearch, and jcseg-server by specifying the repository details in the pom.xml file.
Jcseg Lucene and Solr Interfaces
-
Lucene Interface:
- Import the Jcseg core and analyzer jar files.
- The segmenting code example demonstrates how to instantiate a Jcseg analyzer and customize configurations like synonym and pinyin additions.
-
Solr Interface:
- Copy the necessary jar files to the Solr library directory.
- Update
scheme.xml
with the appropriate field type configurations, specifying the segmentation mode needed (complex, simple, detect, or search).
With its rich feature set and flexibility, Jcseg facilitates advanced text analysis and processing for various Chinese language applications, offering robust support for developers through its configuration and interface support for popular search and indexing platforms.