Introduction to WeTextProcessing
Overview
WeTextProcessing is an innovative text processing toolkit that prioritizes practical applications and readiness for production environments. It's designed to handle the complexities of text normalization (TN) and inverse text normalization (ITN) for both English and Chinese languages. This toolkit is particularly valuable for developers dealing with speech recognition and natural language processing tasks.
What is Text Normalization?
Text Normalization (TN) involves converting written text into a standard format. This is crucial when preparing text data for consistent processing. For example, numbers written in various formats like "2.5" or "二点五" are standardized to a uniform representation. WeTextProcessing provides tools to achieve this effortlessly, even accommodating specific language nuances.
What is Inverse Text Normalization?
Inverse Text Normalization (ITN) is the reverse process, where standardized text is converted back to its original spoken form. This process is essential for applications like speech synthesis, where different forms of data need to be transformed into a human-friendly format. For instance, the normalized “二点五” can be reverted to its spoken form "two point five" or vice versa.
Getting Started
The toolkit is designed for ease of use, allowing users to quickly integrate it into their projects. Installation is straightforward with Python's package manager, pip:
pip install WeTextProcessing
Once installed, users can perform text normalization with simple command-line usage or through Python scripts. Here is a brief example:
from tn.chinese.normalizer import Normalizer as ZhNormalizer
from itn.chinese.inverse_normalizer import InverseNormalizer
zh_tn_model = ZhNormalizer(overwrite_cache=True)
zh_itn_model = InverseNormalizer(overwrite_cache=True)
text = "2.5平方电线"
normalized_text = zh_tn_model.normalize(text)
inverse_text = zh_itn_model.normalize(normalized_text)
print("Normalized Text:", normalized_text)
print("Inverse Normalized Text:", inverse_text)
The above code demonstrates how users can easily normalize a text string and then inverse normalize it back to its original form or another desired format.
Advanced Capabilities
WeTextProcessing also offers advanced capabilities for users who require custom rules. This feature is particularly useful for tailoring the toolkit to specific languages or domain-specific vocabulary. Users can modify existing normalization rules and deploy changes using either Python or a C++ runtime environment.
To modify and deploy custom rules, users can clone the WeTextProcessing repository and adjust the rules as needed:
git clone https://github.com/wenet-e2e/WeTextProcessing.git
cd WeTextProcessing
pip install -r requirements.txt
After making changes to the rules, rebuild the rules with:
python -m tn --text "Your text here" --overwrite_cache
Users can then deploy the modified processing pipelines in their applications, ensuring that text normalization and inverse text normalization meet their precise requirements.
Further Reading and Community Support
For those interested in the technical pipeline, detailed documentation is provided within the repository. Community support is available via GitHub Issues, and a dedicated WeChat group facilitates discussion and quick responses.
Acknowledgments
WeTextProcessing is built on a solid foundation of open-source projects and collective expertise. Special thanks are due to the creators of OpenFst, Pynini, the NeMo community, and various contributors who have made advancements in text processing possible.
WeTextProcessing stands as a testament to collaborative innovation, bringing robust text processing capabilities to developers and researchers working in the field of speech and language technology.