whisper-diarization - Enhance Speaker Diarization with Whisper and NeMo Technologies

Speaker Diarization Using OpenAI Whisper

Overview

The Whisper-diarization project is an advanced system combining OpenAI's Whisper with additional tools for speaker diarization. This innovative pipeline aims to accurately identify who is speaking in an audio file, by utilizing the capabilities of Whisper's Automatic Speech Recognition (ASR) along with Voice Activity Detection (VAD) and speaker embedding technologies.

How It Works

The process begins by enhancing the clarity of each speaker's voice extracted from the audio input. This step improves the precision of the speaker embedding features. The audio is then transcribed using OpenAI's Whisper, followed by time-aligning the transcription using a tool called ctc-forced-aligner to reduce errors caused by time discrepancies.

Once the timestamps are finely adjusted, the audio goes through MarbleNet, which performs VAD and segments the audio into individual speech parts, effectively eliminating silences. TitaNet technology is then deployed to extract speaker embeddings for each segment to determine the speaker's identity.

The timestamps generated by ctc-forced-aligner help match each spoken word to a specific speaker. Adjustments are made with punctuation models to counteract slight time drifts, culminating in a comprehensive diarization where each word is linked to a speaker accurately.

Installation Requirements

To set up this project, users need Python version 3.10 or above. Although Python 3.9 is functional, the user must manually install dependencies. Essential prerequisites include FFMPEG for processing audio files and Cython for compiling the necessary components.

Installation Steps:

Install Cython:

pip install cython

Or for Debian-based systems:

sudo apt update && sudo apt install cython3

Install FFMPEG according to your OS:

Ubuntu/Debian:

sudo apt update && sudo apt install ffmpeg

Arch Linux:
```
sudo pacman -S ffmpeg
```
MacOS:
```
brew install ffmpeg
```

Windows: Using Chocolatey:

choco install ffmpeg

Using Scoop:

scoop install ffmpeg

Using WinGet:

winget install ffmpeg

Install additional requirements:

pip install -c constraints.txt -r requirements.txt

Usage Instructions

To use the diarization tool, execute the following command:

python diarize.py -a AUDIO_FILE_NAME

For systems equipped with VRAM of 10GB or more, diarize_parallel.py can be used to run NeMo concurrently with Whisper, optimizing the workflow in certain scenarios. Note that this functionality is experimental.

Command Line Options

-a AUDIO_FILE_NAME: Specifies the audio file for processing.
--no-stem: Turns off source separation.
--whisper-model: Selects the ASR model, with medium.en being the default.
--suppress_numerals: Converts numbers to words to enhance alignment.
--device: Chooses the processing device, defaulting to "cuda" if present.
--language: Sets the spoken language manually when auto-detection fails.
--batch-size: Adjusts for batched processing, useful for managing memory usage.

Limitations and Future Directions

Currently, the system struggles with overlapping speech. Future enhancements might involve separating audio to isolate individual speakers first, a step that requires significant computational resources. Additionally, there are plans to implement a feature for limiting sentence lengths in subtitles.

Acknowledgements

The project is grateful to @adamjonas for his support. The foundation of this work stands on technologies like OpenAI's Whisper, Nvidia NeMo, and Facebook's Demucs.

For those using the project in their research, citation details are provided:

@unpublished{hassouna2024whisperdiarization,
  title={Whisper Diarization: Speaker Diarization Using OpenAI Whisper},
  author={Ashraf, Mahmoud},
  year={2024}
}