Speaker Diarization Using OpenAI Whisper
Overview
The Whisper-diarization project is an advanced system combining OpenAI's Whisper with additional tools for speaker diarization. This innovative pipeline aims to accurately identify who is speaking in an audio file, by utilizing the capabilities of Whisper's Automatic Speech Recognition (ASR) along with Voice Activity Detection (VAD) and speaker embedding technologies.
How It Works
The process begins by enhancing the clarity of each speaker's voice extracted from the audio input. This step improves the precision of the speaker embedding features. The audio is then transcribed using OpenAI's Whisper, followed by time-aligning the transcription using a tool called ctc-forced-aligner
to reduce errors caused by time discrepancies.
Once the timestamps are finely adjusted, the audio goes through MarbleNet, which performs VAD and segments the audio into individual speech parts, effectively eliminating silences. TitaNet technology is then deployed to extract speaker embeddings for each segment to determine the speaker's identity.
The timestamps generated by ctc-forced-aligner
help match each spoken word to a specific speaker. Adjustments are made with punctuation models to counteract slight time drifts, culminating in a comprehensive diarization where each word is linked to a speaker accurately.
Installation Requirements
To set up this project, users need Python version 3.10
or above. Although Python 3.9
is functional, the user must manually install dependencies. Essential prerequisites include FFMPEG
for processing audio files and Cython
for compiling the necessary components.
Installation Steps:
- Install Cython:
Or for Debian-based systems:pip install cython
sudo apt update && sudo apt install cython3
- Install FFMPEG according to your OS:
- Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg
- Arch Linux:
sudo pacman -S ffmpeg
- MacOS:
brew install ffmpeg
- Windows:
Using Chocolatey:
Using Scoop:choco install ffmpeg
Using WinGet:scoop install ffmpeg
winget install ffmpeg
- Ubuntu/Debian:
- Install additional requirements:
pip install -c constraints.txt -r requirements.txt
Usage Instructions
To use the diarization tool, execute the following command:
python diarize.py -a AUDIO_FILE_NAME
For systems equipped with VRAM of 10GB or more, diarize_parallel.py
can be used to run NeMo concurrently with Whisper, optimizing the workflow in certain scenarios. Note that this functionality is experimental.
Command Line Options
-a AUDIO_FILE_NAME
: Specifies the audio file for processing.--no-stem
: Turns off source separation.--whisper-model
: Selects the ASR model, withmedium.en
being the default.--suppress_numerals
: Converts numbers to words to enhance alignment.--device
: Chooses the processing device, defaulting to "cuda" if present.--language
: Sets the spoken language manually when auto-detection fails.--batch-size
: Adjusts for batched processing, useful for managing memory usage.
Limitations and Future Directions
Currently, the system struggles with overlapping speech. Future enhancements might involve separating audio to isolate individual speakers first, a step that requires significant computational resources. Additionally, there are plans to implement a feature for limiting sentence lengths in subtitles.
Acknowledgements
The project is grateful to @adamjonas for his support. The foundation of this work stands on technologies like OpenAI's Whisper, Nvidia NeMo, and Facebook's Demucs.
For those using the project in their research, citation details are provided:
@unpublished{hassouna2024whisperdiarization,
title={Whisper Diarization: Speaker Diarization Using OpenAI Whisper},
author={Ashraf, Mahmoud},
year={2024}
}