RealtimeSTT - Real-time Speech-to-Text Library for Low-Latency Applications

Introducing RealtimeSTT

RealtimeSTT is an easy-to-use library designed to transcribe speech into text in real-time, making it ideal for applications that require quick and precise speech recognition.

New Features

The project has recently introduced the AudioToTextRecorderClient class, designed to automatically start a server if one is not already running, and then connect to it seamlessly. This new class mirrors the interface of the existing AudioToTextRecorder, making it simple to upgrade or switch. While it's still a work in progress, with some parameters and callbacks not yet fully implemented, it offers a streamlined process for users.

The command-line interface has also been revamped for simplicity, with "stt-server" to start the server and "stt" to start the client.

About the Project

RealtimeSTT is engineered to listen through the microphone and transform the spoken word into written text. This tool is beneficial for:

Voice Assistants that need to understand and respond to human speech.
Any application requiring swift and accurate speech-to-text conversion.

The project sprung from an original initiative, Linguflex, an advanced open-source assistant offering voice control.

Latest Updates

The current release is version 0.3.4. Users are advised to integrate the if __name__ == '__main__': safeguard when using the multiprocessing module to prevent unexpected behaviors on certain platforms, such as Windows.

Quick Usage Examples

To print everything said:

from RealtimeSTT import AudioToTextRecorder

def process_text(text):
    print(text)

if __name__ == '__main__':
    recorder = AudioToTextRecorder()
    while True:
        recorder.text(process_text)

To type everything that is said:

from RealtimeSTT import AudioToTextRecorder
import pyautogui

def process_text(text):
    pyautogui.typewrite(text + " ")

if __name__ == '__main__':
    recorder = AudioToTextRecorder()
    while True:
        recorder.text(process_text)

Both examples transcribe spoken words into text, the second one goes a step further by typing out the words in your selected text box.

Key Features

Voice Activity Detection: Automatically starts and stops listening as you speak.
Real-time Transcription: Provides immediate conversion of speech to text.
Wake Word Activation: Activates responses based on a specific keyword.

Technology Behind RealtimeSTT

The library utilizes cutting-edge technologies such as:

Voice Detection with WebRTC and Silero VAD for listening to when speech begins and ends.
Speech Recognition using Faster_Whisper for rapid transcription of spoken content.
Wake Word Detection with Porcupine or OpenWakeWord for activating on specific command words.

These technologies provide a robust foundation for real-time audio applications.

Installation & Setup

To get started, simply run:

pip install RealtimeSTT

For optimal performance and if you're running on a capable NVIDIA GPU, it's recommended to enable GPU support:

Update PyTorch accordingly using CUDA, depending on your system configuration:

pip install torch==2.3.1+cu118 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118

Before starting, additional setup may be necessary, such as installing the NVIDIA CUDA Toolkit and NVIDIA cuDNN.

Recommended Usage Practices

RealtimeSTT supports both manual and automatic recording modes, accommodating various use cases from controlled environments to dynamic real-time applications.

Understanding and employing wake words and callbacks can greatly enhance user experience by providing responsive and intelligent interaction based on speech cue detections.

Overall, RealtimeSTT is a powerful library that accommodates both simple and sophisticated real-time transcription needs, designed to integrate into your projects with ease.