pytextclassifier - Comprehensive Python Library for Text Classification and Clustering

PyTextClassifier: An Overview

PyTextClassifier is a versatile Python toolkit designed for performing text classification tasks with ease. This open-source project offers a wide range of features suitable for applications such as sentiment analysis, text risk classification, and beyond. What makes PyTextClassifier stand out is its support for multiple classification algorithms and clustering methods, making it highly adaptable to different text analysis needs.

Features

PyTextClassifier is lauded for its clear algorithm structure, high performance, and customizable corpus capabilities. Some of its key functionalities include:

Classifier Algorithms:

Logistic Regression: Suitable for binary and multi-class classification.
Random Forest: An ensemble method using decision trees.
Decision Tree: A tree-like model for decision-making.
K-Nearest Neighbours (KNN): Classifies by majority vote of neighbors.
Naive Bayes: Based on applying Bayes' theorem.
XGboost: An optimized gradient boosting library.
Support Vector Machine (SVM): Finds a hyperplane to categorize data.
TextCNN: A convolutional neural network for text.
TextRNN: A recurrent neural network for text.
FastText: Efficient text classification and representation.
BERT: A transformer-based model for a variety of tasks.

Clustering Algorithm:

MiniBatchKmeans: A fast and efficient way to process large scale datasets.

Installation

Installing PyTextClassifier is straightforward, with options to use pip or to clone the repository directly. Python version 3.5 or above is required.

pip3 install torch  # or conda install pytorch
pip3 install pytextclassifier

Alternatively, using a git clone:

git clone https://github.com/shibing624/pytextclassifier.git
cd pytextclassifier
python3 setup.py install

Usage

PyTextClassifier can be employed for both English and Chinese text classification with ease. The toolkit simplifies the process of model training, evaluation, and prediction.

English Text Classification

For English text classification, you can see an example file that showcases how to train and evaluate a model using Logistic Regression. Here's a snippet that demonstrates the essential steps:

from pytextclassifier import ClassicClassifier

m = ClassicClassifier(output_dir='models/lr', model_name_or_model='lr')
m.train(data)  # Use training data
m.load_model()  # Load the trained model
predict_label, predict_proba = m.predict(['Sample text for prediction'])

Chinese Text Classification

The toolkit likewise supports Chinese text classification, demonstrating versatility with language compatibility. It operates similarly to its English counterpart, with example files provided for guidance.

Advanced Features

Apart from basic classification, PyTextClassifier also offers advanced functionalities like:

Visual Feature Importance: Helps visualize feature weights and prediction importance for interpretability.
Deep Learning Models: Includes support for models like FastText, TextCNN, TextRNN, and BERT, which enable handling complex classification tasks.
Multi-label and Multi-level Classification: Supports tasks where instances can have multiple labels or hierarchical label structures.
ONNX Export: Allows exporting models to ONNX for more efficient inference and integration with other platforms.

Conclusion

PyTextClassifier presents a robust solution for text classification, offering simplicity in usage paired with advanced capabilities. Whether it's for research or production, PyTextClassifier equips users with the tools needed to effectively manage a variety of text classification scenarios. Its support for deep learning models and clustering algorithms further enhances its utility in various domains concerning text analysis.