whisper-timestamped - Enhance Multilingual Speech Recognition with Precise Word Timestamps

Introduction to Whisper-Timestamped

Whisper-Timestamped is an advanced tool designed for multilingual automatic speech recognition, equipped with features that provide word-level timestamps and confidence measures. This project, an extension of the Whisper models developed by OpenAI, enhances speech recognition tasks by producing more accurate and detailed transcriptions.

Key Features and Innovations

Word-Level Timestamps: Whisper-Timestamped creates precise timestamps for each word spoken in an audio segment. This enhancement is crucial for applications requiring exact timing, such as subtitle generation or detailed audio analysis.
Confidence Scoring: Each word and segment is assigned a confidence score, allowing users to assess the reliability of the transcriptions.
Efficient and Memory-Conscious: The implementation can handle long audio files with minimal memory usage, maintaining performance efficiency even during intensive tasks.
Voice Activity Detection (VAD): Before applying the Whisper model, Voice Activity Detection can be utilized to filter out non-speech segments, ensuring cleaner transcription. Methods available for VAD include silero, auditok, and others.
Language Versatility: If the language is not specified in the input, the utility can provide language probabilities among the outputs, making it incredibly versatile for multi-lingual environments.

Comparison to Other Methods

Whisper-Timestamped offers several advantages over alternative approaches that use wav2vec models. Unlike wav2vec, which requires separate models for each language, Whisper-Timestamped leverages the multilingual capabilities of Whisper models. This approach avoids cumbersome character normalization tasks and is more robust when handling speech disfluencies.

Alternative methods, such as using timestamp tokens for alignment, are less reliable. Whisper models traditionally predict timestamps only after a number of words, leading to accuracy issues. Whisper-Timestamped circumvents these problems with a method grounded in Dynamic Time Warping applied to cross-attention weights.

How to Install

For Standard Installation:

Ensure you have python3 (version 3.7 or higher) and ffmpeg installed.
Use pip to install:
```
pip3 install whisper-timestamped
```

Alternatively, clone the repository and install:

git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped/
python3 setup.py install

For CPU-Only Installation:

If opting out of GPU support, first install a light version of PyTorch:

pip3 install \
     torch==1.13.1+cpu \
     torchaudio==0.13.1+cpu \
     -f https://download.pytorch.org/whl/torch_stable.html

Then install Whisper-Timestamped as noted above for standard installations.

Usage Guide

Python Usage:

Import and use Whisper-Timestamped as a drop-in replacement for Whisper. It provides additional options for word alignment and confidence scoring.

Example:

import whisper_timestamped as whisper

audio = whisper.load_audio("AUDIO.wav")
model = whisper.load_model("tiny", device="cpu")
result = whisper.transcribe(model, audio, language="fr")

import json
print(json.dumps(result, indent=2, ensure_ascii=False))

Command Line Usage:

The command-line interface mirrors Whisper, with the addition of configurations for output formats (including CSV), word confidence, and verbosity.

Quick start:
```
whisper_timestamped audio1.flac audio2.mp3 --model tiny --output_dir .
```
Options include adjustments for compute confidence, punctuation handling, and language detection.

Conclusion

Whisper-Timestamped stands out as a highly efficient tool for speech recognition tasks, offering precise word timings and robust confidence scoring. It extends the capabilities of Whisper, making it adaptable for various languages while minimizing additional resource requirements.

For anyone tackling a project in automatic speech recognition or transcription, Whisper-Timestamped provides a vital edge through its innovative approach to managing timestamps and word confidence within diverse audio content.