hazm - Comprehensive NLP Toolkit for Persian Language Processing

Hazm - Persian NLP Toolkit

Introduction

Hazm is a Python library designed for performing natural language processing (NLP) tasks specifically focused on Persian text. It provides an extensive range of features to analyze, process, and understand Persian text effectively. Users can leverage Hazm to normalize text, tokenize sentences and words, lemmatize words, assign part-of-speech tags, identify dependency relations, create word and sentence embeddings, and access popular Persian corpora.

Features

Normalization: This feature standardizes text by removing diacritics, correcting spacing, and ensuring uniformity.
Tokenization: Hazm can split text into precise units such as sentences and words, making it easy to handle text content.
Lemmatization: It transforms words to their base forms, which is crucial for understanding word usage and context.
POS (Part of Speech) Tagging: Hazm assigns grammatical tags to words to denote their role in the text, such as nouns, verbs, adjectives, etc.
Dependency Parsing: It identifies syntactic connections between words, helping understand the structure of sentences.
Embedding: This feature offers vector representations of words and sentences, aiding in computational text analysis.
Persian Corpora Reading: Hazm facilitates easy access to well-known Persian text datasets, streamlining research and development.

Installation

To install Hazm, run the following command in your terminal:

pip install hazm

For the latest updates from the GitHub, which might be less stable, use:

pip install git+https://github.com/roshan-research/hazm.git

Pretrained Models

For easing the workload, Hazm provides several pretrained models which can be downloaded:

Usage

Here’s a quick demonstration of how to use Hazm:

from hazm import *

# Text normalization
normalizer = Normalizer()
print(normalizer.normalize('اصلاح نويسه ها و استفاده از نیم‌فاصله پردازش را آسان مي كند'))

# Sentence and word tokenization
sentences = sent_tokenize('ما هم برای وصل کردن آمدیم! ولی برای پردازش، جدا بهتر نیست؟')
words = word_tokenize('ولی برای پردازش، جدا بهتر نیست؟')

# Word lemmatization
lemmatizer = Lemmatizer()
print(lemmatizer.lemmatize('می‌روم'))

# POS tagging
tagger = POSTagger(model='pos_tagger.model')
tags = tagger.tag(word_tokenize('ما بسیار کتاب می‌خوانیم'))

# Dependency parsing
parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer)
print(parser.parse(word_tokenize('زنگ‌ها برای که به صدا درمی‌آید؟')))

Documentation

For comprehensive details on features and usage, visit the Hazm Documentation.

Hazm in Other Languages

Hazm has been adapted to other programming languages, although these versions are not maintained by the original developers. Available ports are:

Contribution

The Hazm project welcomes contributions, including bug reports and feature requests. To contribute, consult the Contribution guideline, fork the repository, make your changes, and submit a pull request.

Acknowledgements

Special thanks to the Virastyar project for contributing a Persian word list which enhances the Hazm toolkit.

For further engagement and updates, you can follow Roshan AI on Twitter.