Introduction to TensorflowASR
TensorflowASR is an advanced project designed for developing end-to-end speech recognition models using TensorFlow 2, specifically leveraging the Conformer architecture. It aims to provide efficient and accurate solutions for speech-to-text applications with a real-time factor (RTF) close to 0.1 when executed on a CPU. The current version, referred to as V2, employs a CTC+translate structure, improving upon its predecessor. TensorflowASR is open to bug reports and community feedback to enhance its capabilities.
Project Comparison
The project has been tested with Aishell-1 dataset, showcasing promising results both offline and online:
-
Offline Results: The ConformerCTC model achieves a Chinese Character Error Rate (CER) of 6.8% with a parameter count of 10.1M.
-
Streaming Results: The StreamingConformerCTC model performs with a CER of 7.2%, using 10.1M parameters.
Features and Functionalities
TensorflowASR supports a range of functionalities designed to enhance speech recognition:
- Voice Activity Detection (VAD) and noise reduction.
- Punctuation Restoration for improved text readability.
- Text-to-Speech (TTS) Data Augmentation assisting in synthesizing speech data when raw data is unavailable.
- Online and Offline Recognition capabilities.
- Audio enhancement for varying distances and sound styles.
TTS Data Augmentation System
Even with limited data, TensorflowASR can achieve satisfactory Automatic Speech Recognition (ASR) performance by employing TTS augmentation. Trained with Aishell1 and Aishell3 datasets, it supports multiple voice styles, particularly useful for the Chinese language.
Steps to use the TTS Data Augmentation:
- Prepare a text list for synthesis.
- Download necessary models from the provided link.
- Execute a script to generate synthetic audio data.
Mel Layer and Alternative Options
The project includes a Mel Layer adhering to the Librosa library specifications, enabling improved speech spectrum feature extraction. Alternatively, users may opt for the Leaf library for a smaller parameter scale.
C++ and Python Inference
TensorflowASR extends its reach with C++ and Python inference solutions, using ONNX technology. This allows for efficient deployment and testing across platforms.
Streaming Conformer Architectures
TensorflowASR supports streaming recognition with Conformer architectures:
- Block Conformer + Global CTC: Suitable for short-term recognition scenarios with context-building capabilities.
- Chunk Conformer + CTC Picker: Ideal for long-term streaming recognition, inspired by Baidu's SMLTA2 approach.
Pretrained Models
Models are evaluated on the AISHELL TEST dataset. The real-time capabilities are benchmarked on a single CPU core, highlighting impressive RTF performance.
For example:
- ConformerCTC Model: Aims for a CER of 6.4% with an RTF of 0.056.
- StreamingConformerCTC Model: Reaches a CER of 7.2% with an RTF of 0.08.
Supported Structures and Models
TensorflowASR supports various sophisticated models and architectures:
- CTC and Streaming structures.
- Conformer models like Conformer, Block Conformer, and Chunk Conformer.
Technical Requirements and Setup
Users need Python (3.6+), TensorFlow (2.8+), and a range of supplementary libraries including 'librosa', 'keras-bert', 'tf2onnx', among others. Detailed instructions guide users through configuration and model training procedures.
Community Engagement
TensorflowASR encourages participation in its community, offering a space for discussion and knowledge sharing. Interested individuals can join by adding a contact with the identifier "TensorflowASR."
Recent Updates and Project Developments
Continuous improvements and updates are part of TensorflowASR's roadmap, with the latest including updates to the Chunk Conformer structure for enhanced long-duration ASR.
Licensing and Usage
The project is licensed under Apache 2.0, promoting free academic and commercial application, but prohibits commercial trading of the project itself.
By enabling efficient, accurate, and flexible speech recognition solutions, TensorflowASR stands out as an essential resource in the field of automatic speech recognition development.