Introduction to Whisper-Timestamped
Whisper-Timestamped is an advanced tool designed for multilingual automatic speech recognition, equipped with features that provide word-level timestamps and confidence measures. This project, an extension of the Whisper models developed by OpenAI, enhances speech recognition tasks by producing more accurate and detailed transcriptions.
Key Features and Innovations
-
Word-Level Timestamps: Whisper-Timestamped creates precise timestamps for each word spoken in an audio segment. This enhancement is crucial for applications requiring exact timing, such as subtitle generation or detailed audio analysis.
-
Confidence Scoring: Each word and segment is assigned a confidence score, allowing users to assess the reliability of the transcriptions.
-
Efficient and Memory-Conscious: The implementation can handle long audio files with minimal memory usage, maintaining performance efficiency even during intensive tasks.
-
Voice Activity Detection (VAD): Before applying the Whisper model, Voice Activity Detection can be utilized to filter out non-speech segments, ensuring cleaner transcription. Methods available for VAD include silero, auditok, and others.
-
Language Versatility: If the language is not specified in the input, the utility can provide language probabilities among the outputs, making it incredibly versatile for multi-lingual environments.
Comparison to Other Methods
Whisper-Timestamped offers several advantages over alternative approaches that use wav2vec models. Unlike wav2vec, which requires separate models for each language, Whisper-Timestamped leverages the multilingual capabilities of Whisper models. This approach avoids cumbersome character normalization tasks and is more robust when handling speech disfluencies.
Alternative methods, such as using timestamp tokens for alignment, are less reliable. Whisper models traditionally predict timestamps only after a number of words, leading to accuracy issues. Whisper-Timestamped circumvents these problems with a method grounded in Dynamic Time Warping applied to cross-attention weights.
How to Install
For Standard Installation:
-
Ensure you have
python3
(version 3.7 or higher) andffmpeg
installed. -
Use pip to install:
pip3 install whisper-timestamped
-
Alternatively, clone the repository and install:
git clone https://github.com/linto-ai/whisper-timestamped cd whisper-timestamped/ python3 setup.py install
For CPU-Only Installation:
-
If opting out of GPU support, first install a light version of PyTorch:
pip3 install \ torch==1.13.1+cpu \ torchaudio==0.13.1+cpu \ -f https://download.pytorch.org/whl/torch_stable.html
-
Then install Whisper-Timestamped as noted above for standard installations.
Usage Guide
Python Usage:
-
Import and use Whisper-Timestamped as a drop-in replacement for Whisper. It provides additional options for word alignment and confidence scoring.
Example:
import whisper_timestamped as whisper audio = whisper.load_audio("AUDIO.wav") model = whisper.load_model("tiny", device="cpu") result = whisper.transcribe(model, audio, language="fr") import json print(json.dumps(result, indent=2, ensure_ascii=False))
Command Line Usage:
-
The command-line interface mirrors Whisper, with the addition of configurations for output formats (including CSV), word confidence, and verbosity.
Quick start:
whisper_timestamped audio1.flac audio2.mp3 --model tiny --output_dir .
-
Options include adjustments for compute confidence, punctuation handling, and language detection.
Conclusion
Whisper-Timestamped stands out as a highly efficient tool for speech recognition tasks, offering precise word timings and robust confidence scoring. It extends the capabilities of Whisper, making it adaptable for various languages while minimizing additional resource requirements.
For anyone tackling a project in automatic speech recognition or transcription, Whisper-Timestamped provides a vital edge through its innovative approach to managing timestamps and word confidence within diverse audio content.