Stabilizing Timestamps for Whisper
Stabilizing Timestamps for Whisper is a sophisticated library built to enhance the existing Whisper framework by providing more reliable timestamping and extending its functionality. This project is especially useful for those who rely on accurate speech-to-text conversion and need precise timestamps for audio data.
Setup
To start using the Stable-ts library, you'll need some prerequisites like FFmpeg and PyTorch:
-
FFmpeg: This tool is essential for handling multimedia data. Installation varies based on your operating system:
- Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg
- Arch Linux:
sudo pacman -S ffmpeg
- MacOS with Homebrew:
brew install ffmpeg
- Windows with Chocolatey:
choco install ffmpeg
- Windows with Scoop:
scoop install ffmpeg
- Ubuntu/Debian:
-
PyTorch: Ensure PyTorch is installed with GPU support if necessary by following the instructions here.
Then, you can install the Stable-ts package using:
pip install -U stable-ts
For the latest commit:
pip install -U git+https://github.com/jianfch/stable-ts.git
If you prefer a version without Whisper as a dependency, use:
pip install -U stable-ts-whisperless
With the latest Whisperless commit:
pip install -U git+https://github.com/jianfch/stable-ts.git@whisperless
Usage
The Stable-ts library offers a range of functionalities through its user-friendly API, allowing users to easily transcribe audio files into text with enhanced timestamp accuracy.
Transcribe
Using this library begins with loading a Whisper model and feeding it an audio file to transcribe:
import stable_whisper
model = stable_whisper.load_model('base')
result = model.transcribe('audio.mp3')
result.to_srt_vtt('audio.srt')
You can also utilize the command line interface (CLI):
stable-ts audio.mp3 -o audio.srt
Key Features
- Word-level Timestamps: Extracts detailed timestamps for each word, ensuring timings are precise.
- Adjustable Timestamps: Use silence suppression and voice activity detection (VAD) to tailor timestamps and improve accuracy.
- Custom Regrouping: Allows for regrouping words based on punctuation and speech gaps.
- Additional Processing: Incorporates noise removal, voice isolation, and denoising to enhance transcription quality.
Advanced Capabilities
- Use with Faster Whisper Versions: Compatible with the faster-whisper variant for improved performance.
- Silence Suppression: Automatically adjusts timestamps considering detected silence within the audio.
- Regrouping and Alignment: Fine-tunes word grouping for better segment accuracy.
Tailored for Advanced Users
Whether you're dealing with complex audio data or require specific transcription nuances, Stable-ts accommodates advanced use cases, such as:
- Dynamic Quantization: Optimize model performance with quantization.
- Custom Callbacks: Allow deeper customization of transcription behavior.
- Preprocessing Options: Choose from a variety of denoising and filtering settings to suit different audio environments.
In summary, the Stable-ts project is designed for users who need reliable and accurate transcription solutions. By building on top of the Whisper technology, it offers improved timestamp stability and flexibility, making it a powerful tool for developers working with Automatic Speech Recognition (ASR). Whether you are handling simple audio transcriptions or tackling more complicated audio analysis projects, Stable-ts provides the necessary tools and features to get the job done effectively.