pycorrector - Comprehensive Toolkit for Chinese Text Correction Using Various Models

Introduction to Pycorrector

Pycorrector is a comprehensive Python toolkit designed to address the challenges of Chinese text correction. It supports various types of error corrections including phonetic similarity, visual similarity, and grammatical mistakes. The toolkit is implemented in Python version 3.8 and integrates various models like Kenlm, ConvSeq2Seq, BERT, MacBERT, ELECTRA, ERNIE, and GPT to enhance text correction performance.

Common Error Types in Chinese Text

Chinese text correction involves addressing several common types of errors. These include:

Phonetic Similarity Errors: Mistakes due to similar sounding words, often encountered in pinyin input method.
Visual Similarity Errors: Errors arising from characters that look similar, an issue prevalent in Wubi input or Optical Character Recognition (OCR).
Grammatical Errors: Incorrect sentence structures and word orders.

Various business scenarios may focus on different types of errors based on relevance, such as search engines addressing all types, or speech recognition focusing on phonetic errors.

Key Features of Pycorrector

Kenlm Model: Trains a Chinese NGram language model using the Kenlm statistical language model toolkit. It combines rules and confusion sets to correct spelling errors quickly.
DeepContext Model: A text correction model implemented in PyTorch, inspired by Stanford's NLC model, providing moderate effectivity.
Seq2Seq Model: Utilizes ConvSeq2Seq architecture implemented in PyTorch, achieving notable results in the NLPCC-2018 grammar correction contest.
T5 Model: Utilizes pre-trained Langboat/mengzi-t5-base models fine-tuned on Chinese text correction datasets.
ERNIE_CSC Model: Built using PaddlePaddle, this model is tuned on the ERNIE-1.0 structure catering specifically to Chinese spelling correction.
MacBERT Model: Recommended for Chinese spelling correction, this PyTorch-implemented model incorporates error detection and correction networks.
GPT Models: Including ChatGLM and LLaMA, these GPT models are fine-tuned on Chinese CSC and grammar correction datasets exhibiting exceptional correction capabilities.

Model Evaluation

Pycorrector's effectiveness across different evaluation benchmarks like CSC (Chinese Spelling Correction) and CTC (Chinese Text Correction) is demonstrated using datasets such as SIGHAN-2015, EC-LAW, and MCSC, under strict sentence-level definitions using metrics like F1 score.

Latest Developments

Version 1.1.0: Introduced models based on Qwen2.5 for extensive error corrections including multi-character, missing character, and grammatical errors. Models available are shibing624/chinese-text-correction-1.5b and shibing624/chinese-text-correction-7b.
Version 1.0.0: Added ChatGLM3/LLaMA2 GPT models and restructured implementations for DeepContext, ConvSeq2Seq, and T5 models, further enhancing text correction methods.

Installation and Usage

Installing pycorrector is straightforward using pip:

pip install -U pycorrector

The project encourages experimentation with various models by providing pre-trained models for quick predictions and allowing custom training with user data.

Demo

Pycorrector can be explored through multiple demo platforms:

In conclusion, Pycorrector serves as a robust tool for Chinese text correction, offering a variety of cutting-edge models tailored to different error types, making it a valuable resource for developers working on text processing and correction tasks.