Chinese Speech Pretrain
Overview
The Chinese_speech_pretrain project is an initiative that leverages an extensive dataset of Mandarin speech for unsupervised pretraining. Specifically, this project utilizes 10,000 hours of Chinese audio data from the WenetSpeech train_l dataset. This dataset primarily collects from YouTube and Podcasts, capturing a wide range of recording environments, background noises, and speaking styles across ten major domains, including audiobooks, explanations, documentaries, TV series, interviews, news, readings, speeches, variety shows, and others.
For the purpose of this project, two models have been trained using the Fairseq toolkit: wav2vec 2.0 and HuBERT. Following the configurations outlined in respective research, each type of pretrained model is available in both BASE and LARGE sizes. The BASE model was trained using 8 A100 GPUs, with a gradient accumulation equivalent to using 64 GPUs. Meanwhile, the LARGE model was trained using 16 A100 GPUs, with gradient accumulation simulating 128 GPUs.
Model Downloads
To facilitate easy access, these models are available for download on Hugging Face's model hub. Here are some quick links to obtain the Fairseq models:
These models can also be downloaded from Baidu Drive, with access codes provided in the source text for each model.
Downstream Task: Chinese Speech Recognition
To evaluate the effectiveness of the pretrained models in downstream Automatic Speech Recognition (ASR) tasks, experiments were conducted following the Conformer model configuration from the ESPnet toolkit. The pretrained models act as feature extractors, where they process input speech and extract weighted sums of hidden layer representations, which then replace traditional FBank features as inputs for the Conformer ASR model.
Aishell Dataset Experiment Results
Using the 178-hour training set from Aishell for supervised training, various evaluations were made to compare Character Error Rates (CER) using FBank features, wav2vec 2.0 features, and HuBERT features. Additional comparisons were made using the 10,000-hour WenetSpeech train_l data for training and assessing performance on the Aishell test set. The experiments applied speed perturbation and SpecAugment techniques for data augmentation, decoded using a beam search, and language model rescoring with Transformers.
WenetSpeech Experiment Results
Another set of experiments used the 100-hour WenetSpeech train_s dataset for supervised training. Comparisons of CER were made between FBank features and features from wav2vec 2.0 and HuBERT models. Furthermore, models were trained with train_m (1,000 hours) and train_l (10,000 hours) datasets using FBank features. Unique to this experiment is the absence of speed perturbation or SpecAugment, with beam search decoding, excluding language model rescoring.
Model Usage
For those interested in utilizing these models, the instructions provided in the source text include Python scripts to load and process audio data with either Fairseq or Hugging Face interfaces.
# This model does not have a tokenizer as it was pretrained on audio alone.
# To use this model for speech recognition, a tokenizer should be created and the model fine-tuned with labeled text data.
# Example Python script usage is provided, requiring transformers version 4.16.2.
Embrace and Explore
Researchers and developers are encouraged to utilize these pretrained Chinese speech models to further explore applications across Chinese and various related speech scenarios.
Citations and References
Projects using these models include GPT-SoVITS, and citation information is available for those wishing to reference this project in academic or technical work. The source text also lists comprehensive references supporting the methodologies employed in the project.
Through this extensive endeavor, Chinese_speech_pretrain opens pathways to advanced exploration of speech recognition technologies in the complex landscape of Mandarin language applications.