ASRT Speech Recognition Project
Overview
ASRT is an innovative Chinese speech recognition system built on deep learning technologies. Originating from the GitHub repository by the user nl8590687, it leverages a potent combination of TensorFlow, Python, and neural networks to transform audio inputs into text, specifically focusing on the recognition of Mandarin Chinese speech. The system is released under the GPL-3.0 license, reflecting its commitment to open-source development and collaboration.
Project Foundation
ASRT employs TensorFlow's Keras framework, integrating deep convolutional neural networks (DCNN), long short-term memory (LSTM) networks, and the Connectionist Temporal Classification (CTC) approach. The system also utilizes attention mechanisms to enhance speech-to-text capabilities, ensuring an efficient and accurate recognition process.
System Requirements
To run the training models, certain hardware and software prerequisites are necessary:
Hardware:
- CPU: Minimum of 4 cores (x86_64, amd64)
- RAM: At least 16 GB
- GPU: NVIDIA with 11GB+ of graphical memory, starting at the 1080ti model
- Storage: 500 GB HDD or SSD
Software:
- OS: Linux (Ubuntu 20.04+ / CentOS 7+) for both training and inference, or Windows 10/11 for inference only
- Python: Version 3.9 and above
- TensorFlow: Version 2.5 and above
Quick Start Guide
For users keen to jump in, the project provides a straightforward setup process. It begins with cloning the ASRT repository via Git and downloading the necessary datasets to your machine. This process is followed by unpacking the datasets into a designated storage directory.
$ git clone https://github.com/nl8590687/ASRT_SpeechRecognition.git
$ mkdir /data/speech_data
$ tar zxf <dataset_filename> -C /data/speech_data/
After dataset preparation, acquiring the phonetic label files via a simple Python script is essential.
$ python download_default_datalist.py
Users can then begin model training using:
$ python3 train_speech_model.py
For testing the model:
$ python3 evaluate_speech_model.py
Deploying ASRT
ASRT can be deployed directly using Docker for more efficient operation, providing RESTful and GRPC APIs for easy integration with other applications.
$ docker pull ailemondocker/asrt_service:1.3.0
$ docker run --rm -it -p 20001:20001 -p 20002:20002 --name asrt-server -d ailemondocker/asrt_service:1.3.0
Model and Languages
ASRT's speech model utilizes a DCNN + CTC architecture, processing audio data with a maximum length of 16 seconds. The language model follows a Maximum Entropy Markov Model approach, converting recognized phonetic sequences into Mandarin text with remarkable accuracy—reaching up to 85% correctness for pinyin on test data.
Community and Support
For ongoing queries or contributions, users are encouraged to explore ASRT's comprehensive online documentation, FAQs, and community discussions via GitHub Issues. The creator, AI Limon, provides additional support channels like a dedicated QQ group and a WeChat contact.
Conclusion
ASRT represents a significant advancement in Chinese speech recognition technology, supporting a community-driven approach through its open-source model. With a wealth of resources and a robust support system, it stands as a versatile tool for developers interested in enhancing voice-interaction capabilities in their applications.