NLP Chinese Data Augmentation (nlpcda)
NLP Chinese Data Augmentation (nlpcda) is a powerful tool designed to enhance Chinese language datasets with a single command. The tool supports various data augmentation strategies to improve the performance of NLP models by expanding training data without altering the original text's meaning. This tool is available for installation via the Python package index using the command: pip install nlpcda
.
Key Features
The nlpcda tool facilitates the following data augmentation methods:
-
Random Entity Replacement: This method replaces entities in the text with equivalent entities from a predefined list to create diverse training samples.
-
Synonym Replacement: Words in the text are replaced with their synonyms to produce alternative expressions, enhancing the training set diversity.
-
Homophone Replacement: Characters are swapped with similar-sounding characters, adding variability while maintaining the text's readability.
-
Random Character Deletion: Randomly deletes characters in the text while preserving essential content like numbers and dates, providing robustness against text perturbation.
-
NER Data Augmentation: Enhances Named Entity Recognition (NER) training data by applying BIO-tagged data principles, allowing for effective augmentation of labeled datasets.
-
Character Swapping: Exchanges nearby characters to mimic human reading flexibility, enriching syntactic variability in the text.
-
Equivalent Character Replacement: Replaces characters with their equivalent forms, such as numbers with Chinese numerals, to create varied training examples.
-
Translation-Based Augmentation: Utilizes translation techniques for augmenting text by translating it between languages and back, thereby changing the text's structure while retaining its meaning.
-
SimBERT Sentence Generation: Generates similar sentences using the SimBERT model, producing additional data samples with similar contexts.
Work in Progress
-
Speech-Based Text Transformation: This future feature aims to transform text to speech and back, using models like fastspeech2 and wav2vec2 for high-fidelity text conversion.
-
Number Conversion Tool: A feature to convert digits in text into pure Chinese, aiding in both textual and speech synthesis tasks.
Benefits
The primary goal of nlpcda is to generate large volumes of training data that preserve the original meaning. This aids in improving the generalization capabilities, robustness against adversarial attacks, and handling perturbations in NLP models. The tool has been proven effective in competitions, as illustrated by its use in achieving high placements in various NLP challenges.
API Examples
Below are examples of how to use the nlpcda tool's API for different augmentation techniques:
Random Entity Replacement Example
from nlpcda import Randomword
test_str = '''...'''
smw = Randomword(create_num=3, change_rate=0.3)
rs1 = smw.replace(test_str)
for s in rs1:
print(s)
Synonym Replacement Example
from nlpcda import Similarword
test_str = '''...'''
smw = Similarword(create_num=3, change_rate=0.3)
rs1 = smw.replace(test_str)
for s in rs1:
print(s)
Random Deletion Example
from nlpcda import RandomDeleteChar
test_str = '''...'''
smw = RandomDeleteChar(create_num=3, change_rate=0.3)
rs1 = smw.replace(test_str)
for s in rs1:
print(s)
These code snippets illustrate how the nlpcda library can be used to enhance text data, making it a versatile tool for NLP practitioners working with Chinese language datasets. The project's continued development and expansion of features promise even greater functionality and performance improvements for machine learning models.