Introduction to Whisper Playground
Whisper Playground is an innovative platform designed to easily create real-time speech-to-text applications in 99 different languages. It harnesses the power of faster-whisper, Diart, and Pyannote to deliver seamless transcription experiences. For those interested in trying it, an online demo is available to explore its capabilities.
Getting Started
To begin using Whisper Playground, you need a few software tools and steps:
-
Install Prerequisites: Make sure you have
Conda
, a package management system, andYarn
, a package manager for JavaScript, installed on your device. -
Repository Setup: Clone or fork the Whisper Playground repository from GitHub.
-
Environment Installation: Run the provided script
sh install_playground.sh
to set up both backend and frontend environments efficiently. -
Configuration: Check and adjust the
config.py
file to ensure the transcription settings match your device. Similarly, verify theconfig.js
aligns with your backend configurations and address. -
Run the Backend: Start the backend server with the command
cd backend && python server.py
. -
Launch the Frontend: In a separate terminal, navigate to
interface
and runyarn start
to open the React frontend.
Pyannote Model Access
Whisper Playground employs pyannote.audio models, housed within the Hugging Face Hub. To use these models, a Hugging Face account is essential, as well as agreeing to the terms of use:
- Accept the terms for
pyannote/segmentation
,pyannote/embedding
, andpyannote/speaker-diarization
models. - Install the Hugging Face CLI tool and log in using your user access token, which is found under the Settings -> Access Tokens section in your Hugging Face account.
Key Parameters
Whisper Playground offers customization through various parameters:
- Model Size: Users can select a model from a range of sizes, from tiny to large-v2, depending on the needs.
- Language: Choose the language for transcription.
- Transcription Timeout: Define the waiting time before transcribing audio data.
- Beam Size: Adjust this to influence the number of possible transcriptions, impacting both precision and speed.
- Transcription Method: Select either 'real-time' for immediate transcription or 'sequential' for transcription with contextual pauses.
Troubleshooting
A possible issue on MacOS is the failure of building the wheel for safetensors. Installing Rust using brew install rust
may resolve this.
Known Bugs
Users might encounter the following known issues:
- In sequential mode, there might be uncontrolled speaker swapping.
- In real-time mode, audio not reaching the transcription timeout might not be transcribed.
Feedback on language-specific problems not previously tested is welcome through issue reports.
Licensing
Whisper Playground and its underlying code, along with the Whisper model weights, are released under the MIT License, promoting open-source collaboration and innovation.