Hazm - Persian NLP Toolkit
Introduction
Hazm is a Python library designed for performing natural language processing (NLP) tasks specifically focused on Persian text. It provides an extensive range of features to analyze, process, and understand Persian text effectively. Users can leverage Hazm to normalize text, tokenize sentences and words, lemmatize words, assign part-of-speech tags, identify dependency relations, create word and sentence embeddings, and access popular Persian corpora.
Features
- Normalization: This feature standardizes text by removing diacritics, correcting spacing, and ensuring uniformity.
- Tokenization: Hazm can split text into precise units such as sentences and words, making it easy to handle text content.
- Lemmatization: It transforms words to their base forms, which is crucial for understanding word usage and context.
- POS (Part of Speech) Tagging: Hazm assigns grammatical tags to words to denote their role in the text, such as nouns, verbs, adjectives, etc.
- Dependency Parsing: It identifies syntactic connections between words, helping understand the structure of sentences.
- Embedding: This feature offers vector representations of words and sentences, aiding in computational text analysis.
- Persian Corpora Reading: Hazm facilitates easy access to well-known Persian text datasets, streamlining research and development.
Installation
To install Hazm, run the following command in your terminal:
pip install hazm
For the latest updates from the GitHub, which might be less stable, use:
pip install git+https://github.com/roshan-research/hazm.git
Pretrained Models
For easing the workload, Hazm provides several pretrained models which can be downloaded:
- WordEmbedding (~5 GB)
- SentEmbedding (~1 GB)
- POSTagger (~18 MB)
- DependencyParser (~15 MB)
- Chunker (~4 MB)
Usage
Here’s a quick demonstration of how to use Hazm:
from hazm import *
# Text normalization
normalizer = Normalizer()
print(normalizer.normalize('اصلاح نويسه ها و استفاده از نیمفاصله پردازش را آسان مي كند'))
# Sentence and word tokenization
sentences = sent_tokenize('ما هم برای وصل کردن آمدیم! ولی برای پردازش، جدا بهتر نیست؟')
words = word_tokenize('ولی برای پردازش، جدا بهتر نیست؟')
# Word lemmatization
lemmatizer = Lemmatizer()
print(lemmatizer.lemmatize('میروم'))
# POS tagging
tagger = POSTagger(model='pos_tagger.model')
tags = tagger.tag(word_tokenize('ما بسیار کتاب میخوانیم'))
# Dependency parsing
parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer)
print(parser.parse(word_tokenize('زنگها برای که به صدا درمیآید؟')))
Documentation
For comprehensive details on features and usage, visit the Hazm Documentation.
Hazm in Other Languages
Hazm has been adapted to other programming languages, although these versions are not maintained by the original developers. Available ports are:
Contribution
The Hazm project welcomes contributions, including bug reports and feature requests. To contribute, consult the Contribution guideline, fork the repository, make your changes, and submit a pull request.
Acknowledgements
Special thanks to the Virastyar project for contributing a Persian word list which enhances the Hazm toolkit.
For further engagement and updates, you can follow Roshan AI on Twitter.