HanLP: Han Language Processing
HanLP is a cutting-edge, multilingual natural language processing (NLP) toolkit designed for production environments. It leverages both PyTorch and TensorFlow 2.x engines to deploy the most advanced NLP technologies seamlessly. HanLP stands out due to its comprehensive functionality, high precision, efficient performance, up-to-date corpora, clear architecture, and customizability.
Key Features
HanLP supports a wide array of features across 130 languages, including Chinese (simplified and traditional), English, Japanese, Russian, French, German, and many more. It provides ten joint tasks and numerous single tasks with pre-trained models on dozens of tasks which are regularly updated.
The project offers APIs to accommodate both light and large-scale applications:
- RESTful API: Ideal for agile development and mobile apps with easy-to-use, fast installation, and high precision models that do not require GPU setup.
- Native API: Best suited for professional NLP engineers, researchers, and large data scenarios, requiring Python 3.6 to 3.10, and performs efficiently on CPU, recommended GPU/TPU.
Core Functions
HanLP comes packed with several core language processing functions such as:
- Tokenization: Divides text into tokens.
- Part-of-Speech Tagging (POS): Tags each word with its corresponding part of speech.
- Named Entity Recognition (NER): Identifies and categorizes key entities within the text.
- Dependency Parsing: Analyzes grammatical structure.
- Semantic Dependency Parsing (SDP): Analyzes semantic dependencies.
- Semantic Role Labeling (SRL): Maps relationships between verbs and associated nouns.
- Abstract Meaning Representation (AMR): Converts text to a semantic representation.
- Coreference Resolution: Identifies when two or more expressions refer to the same entity.
- Semantic Textual Similarity (STS): Measures the similarity of different texts.
- Text Style Transfer: Converts text from one style to another.
- Keyword and Keyphrase Extraction: Extracts significant words or phrases.
- Automatic Summarization: Generates a summary of the text, either extractive or abstractive.
- Grammar Error Correction: Identifies and corrects grammatical errors.
- Text Classification: Sorts texts into predefined categories.
- Sentiment Analysis: Determines the sentiment of the text.
How to Use
RESTful API
HanLP's RESTful API can be easily installed and used in various programming languages, such as Python, Go, and Java. A few lines of code grant access to HanLP's powerful features. Here's a quick introduction to getting started with Python:
-
Install the RESTful client:
pip install hanlp_restful
-
Create a client with server address and key:
from hanlp_restful import HanLPClient HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh') # 'auth' left empty for anonymous use
-
Parse the text:
HanLP.parse("2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。")
Native API
For those with deeper technical needs, HanLP's native API provides a robust solution using deep learning engines like PyTorch or TensorFlow. By loading a model and passing sentences through it, users can obtain structured textual data:
import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH)
results = HanLP(['2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。', '阿婆主来到北京立方庭参观自然语义科技公司。'])
Conclusion
HanLP positions itself as a highly adaptable and powerful tool in the NLP space, catering to both everyday application needs and complex, large-scale data processing scenarios. Its impressive breadth of features and flexibility across different programming environments make it an invaluable resource for developers and researchers alike.