ASRT_SpeechRecognition - Advanced Speech Recognition System Utilizing Deep Learning for Chinese Language

ASRT Speech Recognition Project

Overview

ASRT is an innovative Chinese speech recognition system built on deep learning technologies. Originating from the GitHub repository by the user nl8590687, it leverages a potent combination of TensorFlow, Python, and neural networks to transform audio inputs into text, specifically focusing on the recognition of Mandarin Chinese speech. The system is released under the GPL-3.0 license, reflecting its commitment to open-source development and collaboration.

Project Foundation

ASRT employs TensorFlow's Keras framework, integrating deep convolutional neural networks (DCNN), long short-term memory (LSTM) networks, and the Connectionist Temporal Classification (CTC) approach. The system also utilizes attention mechanisms to enhance speech-to-text capabilities, ensuring an efficient and accurate recognition process.

System Requirements

To run the training models, certain hardware and software prerequisites are necessary:

Hardware:

CPU: Minimum of 4 cores (x86_64, amd64)
RAM: At least 16 GB
GPU: NVIDIA with 11GB+ of graphical memory, starting at the 1080ti model
Storage: 500 GB HDD or SSD

Software:

OS: Linux (Ubuntu 20.04+ / CentOS 7+) for both training and inference, or Windows 10/11 for inference only
Python: Version 3.9 and above
TensorFlow: Version 2.5 and above

Quick Start Guide

For users keen to jump in, the project provides a straightforward setup process. It begins with cloning the ASRT repository via Git and downloading the necessary datasets to your machine. This process is followed by unpacking the datasets into a designated storage directory.

$ git clone https://github.com/nl8590687/ASRT_SpeechRecognition.git
$ mkdir /data/speech_data
$ tar zxf <dataset_filename> -C /data/speech_data/

After dataset preparation, acquiring the phonetic label files via a simple Python script is essential.

$ python download_default_datalist.py

Users can then begin model training using:

$ python3 train_speech_model.py

For testing the model:

$ python3 evaluate_speech_model.py

Deploying ASRT

ASRT can be deployed directly using Docker for more efficient operation, providing RESTful and GRPC APIs for easy integration with other applications.

$ docker pull ailemondocker/asrt_service:1.3.0
$ docker run --rm -it -p 20001:20001 -p 20002:20002 --name asrt-server -d ailemondocker/asrt_service:1.3.0

Model and Languages

ASRT's speech model utilizes a DCNN + CTC architecture, processing audio data with a maximum length of 16 seconds. The language model follows a Maximum Entropy Markov Model approach, converting recognized phonetic sequences into Mandarin text with remarkable accuracy—reaching up to 85% correctness for pinyin on test data.

Community and Support

For ongoing queries or contributions, users are encouraged to explore ASRT's comprehensive online documentation, FAQs, and community discussions via GitHub Issues. The creator, AI Limon, provides additional support channels like a dedicated QQ group and a WeChat contact.

Conclusion

ASRT represents a significant advancement in Chinese speech recognition technology, supporting a community-driven approach through its open-source model. With a wealth of resources and a robust support system, it stands as a versatile tool for developers interested in enhancing voice-interaction capabilities in their applications.