stable-ts - Improving Reliable Timestamps for Whisper Transcriptions

Stabilizing Timestamps for Whisper

Stabilizing Timestamps for Whisper is a sophisticated library built to enhance the existing Whisper framework by providing more reliable timestamping and extending its functionality. This project is especially useful for those who rely on accurate speech-to-text conversion and need precise timestamps for audio data.

Setup

To start using the Stable-ts library, you'll need some prerequisites like FFmpeg and PyTorch:

FFmpeg: This tool is essential for handling multimedia data. Installation varies based on your operating system:
- Ubuntu/Debian: sudo apt update && sudo apt install ffmpeg
- Arch Linux: sudo pacman -S ffmpeg
- MacOS with Homebrew: brew install ffmpeg
- Windows with Chocolatey: choco install ffmpeg
- Windows with Scoop: scoop install ffmpeg
PyTorch: Ensure PyTorch is installed with GPU support if necessary by following the instructions here.

Then, you can install the Stable-ts package using:

pip install -U stable-ts

For the latest commit:

pip install -U git+https://github.com/jianfch/stable-ts.git

If you prefer a version without Whisper as a dependency, use:

pip install -U stable-ts-whisperless

With the latest Whisperless commit:

pip install -U git+https://github.com/jianfch/stable-ts.git@whisperless

Usage

The Stable-ts library offers a range of functionalities through its user-friendly API, allowing users to easily transcribe audio files into text with enhanced timestamp accuracy.

Transcribe

Using this library begins with loading a Whisper model and feeding it an audio file to transcribe:

import stable_whisper
model = stable_whisper.load_model('base')
result = model.transcribe('audio.mp3')
result.to_srt_vtt('audio.srt')

You can also utilize the command line interface (CLI):

stable-ts audio.mp3 -o audio.srt

Key Features

Word-level Timestamps: Extracts detailed timestamps for each word, ensuring timings are precise.
Adjustable Timestamps: Use silence suppression and voice activity detection (VAD) to tailor timestamps and improve accuracy.
Custom Regrouping: Allows for regrouping words based on punctuation and speech gaps.
Additional Processing: Incorporates noise removal, voice isolation, and denoising to enhance transcription quality.

Advanced Capabilities

Use with Faster Whisper Versions: Compatible with the faster-whisper variant for improved performance.
Silence Suppression: Automatically adjusts timestamps considering detected silence within the audio.
Regrouping and Alignment: Fine-tunes word grouping for better segment accuracy.

Tailored for Advanced Users

Whether you're dealing with complex audio data or require specific transcription nuances, Stable-ts accommodates advanced use cases, such as:

Dynamic Quantization: Optimize model performance with quantization.
Custom Callbacks: Allow deeper customization of transcription behavior.
Preprocessing Options: Choose from a variety of denoising and filtering settings to suit different audio environments.

In summary, the Stable-ts project is designed for users who need reliable and accurate transcription solutions. By building on top of the Whisper technology, it offers improved timestamp stability and flexibility, making it a powerful tool for developers working with Automatic Speech Recognition (ASR). Whether you are handling simple audio transcriptions or tackling more complicated audio analysis projects, Stable-ts provides the necessary tools and features to get the job done effectively.