whisper-playground - Develop Multi-Language Real-Time Speech-to-Text Applications with Enhanced Libraries

Introduction to Whisper Playground

Whisper Playground is an innovative platform designed to easily create real-time speech-to-text applications in 99 different languages. It harnesses the power of faster-whisper, Diart, and Pyannote to deliver seamless transcription experiences. For those interested in trying it, an online demo is available to explore its capabilities.

Getting Started

To begin using Whisper Playground, you need a few software tools and steps:

Install Prerequisites: Make sure you have Conda, a package management system, and Yarn, a package manager for JavaScript, installed on your device.
Repository Setup: Clone or fork the Whisper Playground repository from GitHub.
Environment Installation: Run the provided script sh install_playground.sh to set up both backend and frontend environments efficiently.
Configuration: Check and adjust the config.py file to ensure the transcription settings match your device. Similarly, verify the config.js aligns with your backend configurations and address.
Run the Backend: Start the backend server with the command cd backend && python server.py.
Launch the Frontend: In a separate terminal, navigate to interface and run yarn start to open the React frontend.

Pyannote Model Access

Whisper Playground employs pyannote.audio models, housed within the Hugging Face Hub. To use these models, a Hugging Face account is essential, as well as agreeing to the terms of use:

Accept the terms for pyannote/segmentation, pyannote/embedding, and pyannote/speaker-diarization models.
Install the Hugging Face CLI tool and log in using your user access token, which is found under the Settings -> Access Tokens section in your Hugging Face account.

Key Parameters

Whisper Playground offers customization through various parameters:

Model Size: Users can select a model from a range of sizes, from tiny to large-v2, depending on the needs.
Language: Choose the language for transcription.
Transcription Timeout: Define the waiting time before transcribing audio data.
Beam Size: Adjust this to influence the number of possible transcriptions, impacting both precision and speed.
Transcription Method: Select either 'real-time' for immediate transcription or 'sequential' for transcription with contextual pauses.

Troubleshooting

A possible issue on MacOS is the failure of building the wheel for safetensors. Installing Rust using brew install rust may resolve this.

Known Bugs

Users might encounter the following known issues:

In sequential mode, there might be uncontrolled speaker swapping.
In real-time mode, audio not reaching the transcription timeout might not be transcribed.

Feedback on language-specific problems not previously tested is welcome through issue reports.

Licensing

Whisper Playground and its underlying code, along with the Whisper model weights, are released under the MIT License, promoting open-source collaboration and innovation.