speech-recognition-uk - Robust Ukrainian Speech Recognition and Synthesis Resources

Speech Recognition & Synthesis for Ukrainian

Introduction

The Speech Recognition & Synthesis for Ukrainian repository serves as a comprehensive resource for developers and researchers interested in building or improving Ukrainian language models for both Speech-to-Text (STT) and Text-to-Speech (TTS) applications. This repository aggregates links to relevant models, datasets, tools, and community forums.

Community Hub

To foster collaboration and exchange ideas, the project has established online communities:

Discord: Join the conversation here.
Telegram Channels:
- Speech Recognition: Join here.
- Speech Synthesis: Join here.

Speech-to-Text

The STT section outlines various implementations, benchmarks, and development resources for converting spoken Ukrainian into text. Each model has been meticulously crafted to understand and transcribe spoken words accurately:

Implementations

wav2vec2 and wav2vec2-bert: These models, available in different parameter sizes, are designed for high accuracy speech recognition. Notably, some versions are supplemented with language models trained on news or Wikipedia text to boost performance.
Citrinet, ContextNet, and FastConformer: These speech models, developed by NVIDIA, offer diverse options based on network size and application needs. Each is optimized for streaming and robust real-time transcription.
VOSK and DeepSpeech: Implementations that focus on robust, language-agnostic features for quick adaptation to Ukrainian speech, using Apache License 2.0.

The repository includes demo links for developers to test these models and discover which implementation best fits their needs.

Benchmarks

Benchmarking the performance of each model is crucial. These include measures such as Word Error Rate (WER) and Character Error Rate (CER), helping developers understand each model's capabilities and limitations. Test sets like the Common Voice 10 are integral in evaluating these performance metrics.

Development Resources

For the curious developer or aspiring data scientist, there are instructions on:

Creating models using Kaldi, a popular toolkit for speech recognition.
Crafting a KenLM language model from Ukrainian Wikipedia data.
Exporting a JIT version of wav2vec2 models for streamlined deployment.

Datasets

This project aggregates a wealth of audio data from a variety of sources essential for training and improving Ukrainian STT models:

Compiled Datasets: Over 188GB from open sources and communities, available via Nextcloud and academic torrents.
Voice of America and FLEURS datasets: Specialized collections for distinct applications.
Community-generated data including Ukrainian podcasts and transcripts provides a local context essential for model accuracy.

Related Projects

The repository showcases related advancements such as:

Language Models: Tailored for Ukrainian language applications on Huggingface.
Inverse Text Normalization for transcribing spoken numbers and dates accurately.
Text Enhancement tools that focus on improving written text comprehension by adding punctuation and capitalization.

Text-to-Speech

Although less detailed in the repository, the TTS section provides examples of utilizing text with specific stresses, aiding in synthesizing more naturally sounding speech.

In conclusion, this repository is a dynamic and continually updated resource that invites developers and researchers engaged in improving Ukrainian language technology to explore, contribute, and derive value from its comprehensive offerings.