#speech recognition
whisper
Whisper, a speech recognition solution by OpenAI, utilizes a Transformer sequence-to-sequence approach for multilingual transcription and language identification. With models ranging from 'tiny' to 'turbo', it balances speed with accuracy and is compatible with multiple Python versions, supporting comprehensive audio processing tasks in Python as well as via command-line, catering to developers in need of robust pre-trained models across multiple languages.
AudioGPT
AudioGPT is an open-source initiative providing tools for analyzing and creating speech, music, and other audio forms. The project supports tasks such as text-to-speech, style transfer, and speech recognition through models like FastSpeech and whisper. For audio manipulation, it includes tasks like text-to-audio and image-to-audio using models such as Make-An-Audio. It also offers talking head synthesis with GeneFace. As some features are being refined, AudioGPT continuously broadens its functionality for varied audio projects.
awesome-whisper
Explore a curated list of tools and resources for Whisper, OpenAI's open-source speech recognition system. This organized catalog features official documentation, model variations, apps, CLI utilities, web platforms, articles, videos, and community links. Understand implementations for diverse uses, including iOS and macOS applications, web solutions, and third-party APIs, focusing on speed, speaker diarization, and accuracy advancements, all aimed at enhancing speech-to-text processes across platforms.
willow-inference-server
Willow Inference Server (WIS) enables efficient language processing for self-hosted ASR, TTS, and LLM tasks with CUDA optimization, supporting affordable GPUs like the GTX 1060 and Tesla P4. It facilitates simultaneous model loading with low VRAM demand. Real-time speech recognition, custom TTS voices, and LLaMA-based functions enhance its utility, providing high performance even on lesser GPUs. WIS supports REST, WebRTC, and Web Sockets for broad integration across applications.
make-a-smart-speaker
Discover a wealth of open-source resources for assembling a smart speaker from scratch. Explore essential technologies including audio processing and natural language algorithms, and examine leading projects like Mycroft and SEPIA. Delve into SDKs like Amazon's Alexa and Google Assistant to enhance functionality, and leverage advanced libraries for a personalized, privacy-oriented smart speaker using Raspberry Pi.
june
The june project offers a local voice chatbot utilizing Ollama, Hugging Face Transformers, and Coqui TTS Toolkit. This solution ensures privacy by processing interactions locally without sending data externally. Features include various interaction modes, comprehensive installation documentation, and a FAQ section, making it suitable for secure and private local voice-assisted applications.
ASRT_SpeechRecognition
The ASRT_SpeechRecognition project implements a robust Chinese speech recognition system using TensorFlow and advanced neural network architectures such as DCNN and RNN with attention mechanisms. Compatible with Linux and Windows, the system offers versatile models for effective speech recognition. It includes comprehensive instructions on training, testing, and deploying the API server using both HTTP and GRPC protocols. Detailed documentation and client SDKs facilitate smooth integration and deployment for developers and researchers.
sherpa-onnx
The repository provides local speech processing features such as speech-to-text and text-to-speech, as well as speaker identification. It supports x86, ARM, and RISC-V architectures on operating systems like Linux, macOS, Windows, Android, and iOS. API support is available in C++, Python, JavaScript, and more, with WebAssembly compatibility. This tool covers voice activity detection, keyword spotting, and audio tagging, making it suitable for various applications across different systems and hardware. Explore its capabilities on Huggingface spaces without installation.
TensorflowASR
Discover the Conformer-based speech recognition model designed for TensorFlow 2, featuring a real-time factor of ~0.1 on CPUs. Using CTC with translate structure, it supports online and offline recognition, VAD, noise reduction, and TTS data augmentation. Competitive results on the Aishell-1 dataset highlight its capabilities in both offline and streaming contexts. Supports models like Conformer, BlockConformer, and ChunkConformer for efficient ASR solutions.
stt
An offline speech recognition tool that effectively converts audio or video into text using the fast-whisper model. It offers output in formats like JSON and SRT, making it a viable alternative to OpenAI and Baidu's APIs. Users can choose model sizes to match hardware capabilities and leverage CUDA acceleration with NVIDIA GPUs. This tool is easy to deploy on Windows, Linux, and Mac, with straightforward setup and detailed API documentation.
espnet
The toolkit facilitates end-to-end speech recognition and text-to-speech using PyTorch and Kaldi-style data processing. It manages numerous tasks like speech recognition, translation, enhancement, and diarization efficiently. By providing detailed recipes for ASR and TTS, and integrating with neural vocoders, it supports offline and streaming functionalities, making it a valuable resource for speech technology research and development.
MITSUHA
The project offers a virtual assistant with advanced voice interaction and smart home capabilities, supporting multiple languages. It uses Python, OpenAI's Whisper, and HyperDB to facilitate effective communication and smart home control via microphone and Tuya Cloud IoT. The assistant is designed for English, Japanese, Korean, and Chinese, and includes features like long-term memory and plans for VR integration.
ChatGPT-OpenAI-Smart-Speaker
Utilizes OpenAI and Google Speech Recognition within a Raspberry Pi platform to create an interactive smart speaker. Offers speech-to-text functionality and customizable voice model integration with PicoVoice. Easily deployable on both PC/Mac and Raspberry Pi, it supports various languages and operational settings, making it versatile for tech enthusiasts.
LangHelper
LangHelper provides a dynamic platform for AI-driven speech recognition and synthesis, featuring multi-accent communication and celebrity voice interaction. It includes advanced pronunciation assessment tools suitable for IELTS/TOEFL test preparation and supports integration with ChatGPT and espeak-ng for efficient setup on Windows. The project is engineered to offer interactive sessions through both custom and default text-to-speech functionalities, catering to accent training and pronunciation competence.
RuntimeSpeechRecognizer
Runtime Speech Recognizer provides efficient speech recognition utilizing OpenAI Whisper. It supports both English-only and multilingual models covering up to 100 languages. Offers the ability to choose model sizes from 75 Mb to 2.9 Gb with automatic language model download and optional speech translation to English. Customizable features come without static libraries or external dependencies, allowing cross-platform integration on Windows, Mac, Linux, Android, and iOS. Ideal for developers in need of reliable speech recognition across different applications.
stable-ts
The stable-ts library enhances the functionality of Whisper by providing reliable timestamps in audio transcription. Key integrations include voice isolation, noise reduction, and dynamic time warping to achieve precise word-level timestamps. It supports diverse model configurations and preprocessing methods, improving transcription accuracy by suppressing silence and refining outputs. The library requires FFmpeg and PyTorch for installation, offering cross-platform compatibility and customizable options via denoisers and voice detection methods. Additionally, it connects with any ASR system, enabling its application in various audio transcription scenarios.
Whisper-Finetune
The project aims to enhance the Whisper speech recognition model through Lora fine-tuning, accommodating diverse training contexts such as timestamped and non-timestamped data. It also accelerates inference using CTranslate2 and GGML. Capable of recognizing and translating 98 languages into English, the Whisper model is deployable on Windows, Android, and servers, adaptable to both original and fine-tuned model versions. Discover how models from whisper-tiny to whisper-large-v3 help minimize word error rates across different datasets and utilize available tools for effortless integration into diverse applications.
Whisper-Finetune
Discover how to optimize Whisper, the advanced ASR model with multilingual support. The project emphasizes Lora fine-tuning for non-timestamped, timestamped, and audio-less data, and accelerates model performance via CTranslate2 and GGML for deployment on diverse platforms, including Windows and Android. Recent updates show enhanced Chinese recognition and processing speed, with this comprehensive guide detailing setup, data preparation, and evaluation strategies for maximizing Whisper's potential.
SenseVoice
SenseVoice is a speech foundation model providing capabilities in automatic speech recognition, speech emotion recognition, and audio event detection across 50+ languages. It delivers high precision in multilingual recognition, outperforming many leading models. The non-autoregressive framework offers significantly faster audio processing, up to 15 times quicker than comparable models. With flexible finetuning and versatile deployment options, the model meets varied business and technical requirements. Recent enhancements include ONNX and libtorch export features, improving integration and usability.
PaddleSpeech
PaddleSpeech provides a robust toolkit on the PaddlePaddle platform for speech recognition, translation, and text-to-speech synthesis. It features award-winning models recognized by NAACL2022, making it suitable for various audio processing applications across multiple languages. The toolkit offers regular updates with cutting-edge models and facilitates easy system integration. It caters to researchers and developers aiming for precise audio processing, featuring reliable text-to-speech synthesis, accurate speech recognition, and efficient speech translation.
android-speech
The Android-Speech library streamlines the process of implementing speech recognition and text-to-speech features in Android applications. It offers simple Gradle setup, extensive examples, and customizable views for speech interactions. Developers benefit from adjustable voice, locale options, and logging settings, making the library versatile and adaptable. A demo app is available for easy adoption, ensuring efficient audio processing. With robust community support and detailed documentation, it's suited for applications aiming to improve interaction through natural language processing.
parrots
Parrots provides an efficient solution for Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) with multilingual support in Chinese, English, and Japanese. Utilizing models like 'distilwhisper' and 'GPT-SoVITS', this toolkit facilitates seamless voice recognition and synthesis. It supports straightforward installation, command-line operations, and integration with platforms like Hugging Face, ideal for applications necessitating advanced voice interaction.
gpt-voice-conversation-chatbot
GPT-Voice-Conversation-Chatbot utilizes OpenAI's ChatGPT and GPT-4 APIs, enabling AI-driven voice interactions and supported by options such as memory management and personalization features. The platform is accessible through a terminal or voice input, enhanced with Google TTS and ElevenLabs voice services, requiring an OpenAI API key for functionality. Compatibility spans both Windows and Linux environments, offering customizable parameters like creativity, voice modulation, and conversational presets. Suitable for diverse applications such as language learning, coding assistance, and casual conversations, it provides an adaptable and intuitive user experience.
audio
TorchAudio is a powerful library that leverages PyTorch's GPU capabilities for audio processing. It supports various audio formats and includes dataloaders for common datasets, making it integral for machine learning. Features include audio I/O, speech processing, and transforms like Spectrogram and MFCC, ensuring smooth PyTorch integration. Compliance interfaces enhance compatibility with other tools, offering a seamless experience for PyTorch users in audio and speech fields. Discover more about TorchAudio's features in the documentation.
3D-Speaker
Discover an open-source platform designed for single- and multi-modal speaker verification, recognition, and diarization. Benefit from pretrained models on ModelScope and utilize the large-scale 3D-Speaker speech corpus for research in speech representation. This toolkit includes multiple training and inference recipes for datasets such as 3D-Speaker, VoxCeleb, and CN-Celeb, featuring models like CAM++, ERes2Net, ERes2NetV2, and ECAPA-TDNN. Keep updated with regular releases and comprehensive documentation, making it a valuable resource for researchers and developers in speech technology.
chinese_speech_pretrain
This project uses extensive Chinese audio data from sources like YouTube and Podcasts to train models such as wav2vec 2.0 and HuBERT via Fairseq. These models, available in BASE and LARGE versions, enhance speech recognition and are evaluated on datasets like Aishell and WenetSpeech. Accessible on Hugging Face, these models are suitable for diverse applications, showing improved performance in varied noise and recording settings.
pykaldi
PyKaldi acts as a bridge to smoothly integrate Kaldi's robust capabilities into Python. It offers comprehensive wrappers for C++ code from Kaldi and OpenFst, tailored for the speech recognition community. PyKaldi facilitates complex processes like manipulating Kaldi objects and using low-level functions without requiring extensive C++ knowledge. It includes high-level ASR, alignment, and segmentation modules to boost effectiveness for Python developers. Its NumPy integration ensures efficient data manipulation, backed by a modular design for easy maintenance and scalability. PyKaldi effectively extends Python's reach in ASR projects, enhancing synergy between Python and Kaldi.
lectures
Discover a cutting-edge course on Natural Language Processing focusing on neural networks' applications in speech and text analysis. Delve into essential topics like sequential language modeling and transduction tasks, complemented by hands-on projects on CPU and GPU hardware. This course is directed by Phil Blunsom with the DeepMind Natural Language Research Group, aiming to deepen understanding of neural networks in NLP.
FunClip
FunClip is an open-source tool designed for precise video clipping using Alibaba TONGYI's Paraformer and SeACo-Paraformer models. It supports local deployment for conducting video speech recognition, enabling the extraction and clipping of specific segments by text or speaker. Integrating AI capabilities through large language models (LLM), it offers hotword customization and intuitive use with Gradio and cross-platform accessibility. With enhanced features for English audio and speaker diarization, it provides SRT subtitle generation for entire and clipped videos. Accessible via Modelscope and HuggingFace, it serves both experienced users and newcomers to video editing.
LiveWhisper
LiveWhisper leverages OpenAI's Whisper model to perform continuous audio transcription by capturing microphone input and processing it during silent intervals. It facilitates voice commands for weather updates, trivia, and media control, activating with phrases like 'hey computer'. Serving as an alternative to SpeechRecognition, this system employs sounddevice, numpy, scipy, and libraries like requests and pyttsx3. Contributions via Ko-fi support ongoing development.
whisper-youtube
OpenAI's Whisper facilitates efficient and accurate transcription of YouTube videos with multilingual speech recognition, translation, and language identification. Compatible with various GPUs on Google Colab, it ensures high processing speeds and effective performance, even on less powerful GPUs. Users can modify inference settings and save transcripts and audio to Google Drive. Whisper's capability to handle diverse audio datasets makes it relevant for precise transcriptions.
Whisper-transcription_and_diarization-speaker-identification-
Discover the use of OpenAI's Whisper for precise audio transcription and speaker differentiation with Pyannote-audio. This guide offers comprehensive instructions on audio preparation and the integration of transcription with speaker segments. Benefit from Whisper's robust model trained on vast multilingual data for enhanced performance across diverse acoustic conditions.
k2
k2 aims to integrate Finite State Automaton (FSA) and Finite State Transducer (FST) into autograd-based platforms such as PyTorch and TensorFlow. This is particularly advantageous for speech recognition, allowing diverse training objectives and joint system optimization. The focus on pruned FSA composition facilitates efficient ASR decoding and training, utilizing a codebase largely in C++ and CUDA to support parallel execution. Progressing towards production, k2 offers Python integration with pybind11 and has speech recognition recipes in related repositories.
tensorflow-speech-recognition
Discover insights into speech recognition using the TensorFlow sequence-to-sequence framework. Despite its outdated status, this project serves educational purposes, focusing on creating standalone Linux speech recognition. While new projects like Whisper and Mozilla's DeepSpeech lead advancements, foundational techniques remain essential. Packed with modular extensions and educational examples, it offers a platform for learning and experimentation. Detailed installation guides specify key dependencies such as PyAudio.
wenet
WeNet provides a speech recognition toolkit that is ready for production, focusing on easy installation and efficient performance. Supporting both streaming and non-streaming capabilities, it demonstrates leading results on public speech datasets. With thorough documentation and an emphasis on ease of use, WeNet is ideal for developers incorporating accurate speech recognition into existing systems. It is compatible with Python 3.7/3.8 and CUDA, facilitating rapid deployment with options for pretrained models. Additional resources include installation and usage guides, as well as a supportive community. The toolkit is based on open-source projects like ESPnet and Kaldi.
sherpa
Sherpa is an open-source speech-to-text inference framework focusing on end-to-end models using PyTorch, suitable for deploying pre-trained transducer and CTC-based models. It supports speech transcription via both C++ and Python APIs. For training and fine-tuning models, refer to the Icefall project. Consider Sherpa-ONNX and Sherpa-NCNN for similar projects without PyTorch, which also support iOS, Android, and embedded systems. Comprehensive documentation and a browser-based demo are available for further exploration.
Feedback Email: [email protected]