#Speech Recognition
vosk-api
Vosk is an open source speech recognition toolkit offering offline capabilities in over 20 languages. It is suitable for applications like chatbots, smart devices, and transcription services. The toolkit features compact models for efficient, zero-latency performance and supports multiple programming languages and platforms, ranging from Raspberry Pi to large clusters, making it versatile for various speech-driven tasks.
JARVIS
This project delivers a voice-activated assistant integrating Deepgram's voice-to-text and ElevenLabs' text-to-speech with OpenAI's GPT-3 for dynamic responses. Accessible via a web interface, it appeals to AI enthusiasts and developers. Requires Python 3.8-3.11 and API keys for setup, offering straightforward installation for efficient voice interaction.
whisper.cpp
The Whisper.cpp project delivers efficient ASR model inference through a C/C++ implementation, enhancing compatibility across Apple Silicon, x86, and more. It features mixed precision and quantization, supports Mac OS, Windows, iOS, and WebAssembly, and provides CPU and GPU utilization. Noteworthy for its optimization on Apple Silicon, it allows offline use on mobile devices and in-browser operation, ensuring efficient and portable ASR application development.
pyvideotrans
Provides a solution for translating and dubbing videos across various languages with automated subtitle and voiceover generation. Enables wide support for speech recognition and text-to-speech models such as OpenAI and Google. Facilitates batch processing tasks including audio-visual conversions, subtitle translations, and video merges. Compatible with numerous languages and various operating systems including Windows, MacOS, and Linux. Pre-packaged Windows versions available. Ideal for developers looking for integration with video translation APIs.
dla
Discover the forefront of audio deep learning in a methodically designed course featuring weekly lectures, seminars, and self-study opportunities. This autumn 2024 offering at the HSE CS Faculty delves into essential areas such as digital signal processing, speech recognition, and audio-visual fusion. The course provides detailed exploration of advanced models like CTC, RNN-T, and self-supervised learning. Engage in practical projects such as training speech recognition models and audio-visual separation to solidify your understanding of audio technologies.
ReazonSpeech
ReazonSpeech offers versatile speech recognition tools based on FastConformer-RNNT and Conformer-Transducer models, ensuring speed and accuracy across multiple applications. The repository supports popular platforms such as Nvidia NeMo, sherpa-onnx, and ESPnet, catering to needs ranging from building robust applications to analyzing Japanese TV streams. Key packages like reazonspeech.nemo.asr, reazonspeech.k2.asr, and reazonspeech.espnet.asr are available for seamless integration and development.
NeMo
NVIDIA's NeMo Framework facilitates the creation of large language models and speech recognition systems with modular design and ease of use. It offers updated support for Llama 3.1 LLMs and enhances AI training with compatibility for Amazon EKS and Google Kubernetes Engine, while delivering significant improvements in ASR model inference speeds and multilingual capabilities through the Canary model.
ai-audio-datasets
The repository provides a wide array of audio datasets, such as speech, music, and sound effects, that are crucial for training AI models and advancing AI-generated content. Supporting speech recognition, TTS systems, and research in emotion and language translation, these datasets accommodate various languages and demographics, suitable for both academic and commercial applications. Discover resources from multi-speaker corpora to diverse multilingual speech translation datasets, designed to promote innovation in audio AI technologies.
conformer
Discover how the Conformer model seamlessly integrates convolutional neural networks with transformers to enhance speech recognition. This method efficiently captures both local and global audio dependencies, offering improved accuracy over existing models. Built on PyTorch, it supports state-of-the-art performance and can be easily trained via OpenSpeech in Python environments. Highlights include straightforward installation, detailed usage guidance, and open-source contribution opportunities, adhering to PEP-8 standards.
CapsWriter-Offline
CapsWriter-Offline provides reliable voice input and transcription for PCs, triggered by Caps Lock, with support for English-Chinese mix and additional functionalities such as dynamic hotword adaptation and diary logging. It includes comprehensive compatibility with Windows and Linux, enabling seamless transcription through built-in FFmpeg.
audio-development-tools
Discover a wide array of open-source tools for audio and music development, including machine learning, audio generation, signal processing, and sound synthesis. This collection also provides resources in game audio, digital audio workstations, and speech technologies such as recognition and synthesis. Explore applications in spatial audio, music information retrieval, and singing voice synthesis, with a focus on deep learning applications to boost productivity and creativity in digital audio creation.
inference
The Xorbits Inference library streamlines deployment and management for advanced language, speech recognition, and multimodal models. It offers seamless hardware integration and flexible resource utilization, supporting state-of-the-art open-source models. Designed for developers and data scientists, Xorbits Inference facilitates deployment with a single command and supports distributed deployment alongside various interfaces like RESTful API, CLI, and WebUI. Stay updated with recent framework enhancements and models through extensive documentation and community support.
awesome-speech-recognition-speech-synthesis-papers
This repository provides a curated collection of key research papers in speech recognition and synthesis, covering areas like Text-to-Audio, Automatic Speech Recognition (ASR), Speaker Verification, Voice Conversion (VC), and Speech Synthesis (TTS). It also delves into specialized topics including Language Modelling, Confidence Estimates, and Music Modelling. The compilation features foundational works and recent advancements, offering valuable insights for researchers and practitioners in the field of audio processing. This serves as an extensive knowledge base for understanding the evolution of techniques and applications influencing today's speech and audio processing developments.
Maix-Speech
Maix Speech is an optimized AI speech library that operates on both embedded devices and PCs, offering functionalities like Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) exclusively for Chinese language. It facilitates voice interaction seamlessly across different platforms like x86x64 and R329. This library is licensed under Apache 2.0 and provides developers with the resources needed to incorporate speech functions into their own systems, with detailed instructions accessible on its GitHub repository.
kaldi
Kaldi provides a robust toolkit for speech recognition across UNIX, Windows, mobile, and web platforms. It includes setup instructions for UNIX systems and platform-specific guidance for PowerPC, Android, and Web Assembly. The toolkit adheres to Google C++ Style Guides, facilitating contributions through a detailed development workflow and community forums. Explore technical documentation and C++ coding tutorials on the project's website for efficient integration in diverse environments.
ASR_Theory
Explore recent advancements in speech recognition, combining theory and practice through projects using tools like Kaldi. Learn about significant papers, updates on future studies, and key presentations including the 2018 INTERSPEECH by Google. Discover deep learning models available on GitHub for syllable, word, and phonetic unit-based acoustic models. For ongoing updates, follow the Meta-Speech website and related communities.
wav2letter
Explore the next step in speech recognition development with wav2letter's integration into Flashlight ASR. Access pivotal pre-consolidation resources and use detailed recipes for reproducing significant research models like ConvNets and sequence-to-sequence architectures. Utilize data preparation tools, confirm reproducibility with Flashlight 0.3.2, and connect with the dynamic wav2letter community. This MIT-licensed project offers innovative solutions for both supervised and semi-supervised speech recognition.
whisper_android
This guide details how to incorporate Whisper and the Recorder class into Android apps for effective offline speech recognition. It includes setup methods using TensorFlow Lite, practical code examples for Whisper initialization, and audio recording integration for efficient speech-to-text functionality. The tutorial covers key aspects such as setting file paths, managing permissions, and ensuring accurate transcription, thus enhancing Android app capabilities with reliable offline speech recognition.
GigaSpeech
GigaSpeech is a significant ASR corpus consisting of 10,000 hours of transcribed audio designed for a broad range of speech recognition applications. The dataset continually evolves to support numerous speech recognition toolkits like Kaldi and ESPnet, ensuring easy data preparation. Featuring contributions from major institutions, it offers rich audio sources including audiobooks, podcasts, and YouTube content suitable for both supervised and semi-supervised learning. With detailed metadata and resampling guidelines, it aims to extend ASR features, supporting future tasks such as speaker identification and language diversification. A valuable resource for researchers and developers in need of a comprehensive audio dataset.
FunASR
The toolkit serves as a bridge between academic research and industry, facilitating tasks like synchronous ASR, VAD, punctuation restoration, and speaker verification. It supports the fine-tuning and inference of high-performance pre-trained models. The Model Zoo, featuring models such as Paraformer and Whisper-large, is accessible via ModelScope and Hugging Face, catering to multilingual requirements. Easy-to-use scripts and tutorials further simplify the deployment of speech recognition services, assisting developers in crafting tailored solutions.
speech_dataset
Explore a diverse collection of speech datasets in multiple languages, including Chinese, English, and Japanese, designed for speech recognition, synthesis, and speaker diarization. This collection supports various applications, such as speech commands and ASR system evaluation, facilitating advancements in speech technology. Notable datasets like Common Voice and LibriSpeech play a crucial role in enhancing machine learning models. This resource is invaluable for researchers seeking comprehensive audio data for developing speech-related solutions across different linguistic contexts.
Feedback Email: [email protected]