docker-whisperX - Optimize Docker Workflows with GPU-Accelerated Speech Recognition

docker-whisperX Project Introduction

Docker-whisperX is a community-driven project that provides a Docker image for WhisperX, which is used for automatic speech recognition with word-level timestamps and speaker diarization. The purpose of this project is to manage the continuous integration of Docker builds efficiently, particularly using the GitHub Free runner. The workflow involves building a large number of Docker images in parallel, each with a significant size, by optimizing docker layer caches and managing cache order to minimize image size and build time.

Setting Up Docker for GPU Support

Windows

To get Docker ready for GPU support on Windows, you need to install Docker Desktop, the CUDA Toolkit, and NVIDIA Windows Driver. It's crucial to ensure Docker is running with WSL2. More detailed information can be found in the official documentation provided by NVIDIA and Docker.

Linux, macOS

For Linux and macOS users, the installation of an NVIDIA GPU Driver is necessary if it's not already installed. Following that, the NVIDIA Container Toolkit should be installed according to NVIDIA's official guide to enable Docker GPU support.

Pre-built Docker Images

The WhisperX project also offers pre-built Docker images, which are updated regularly with the WhisperX code base. These images can be run with different parameters to transcribe audio into different formats, like SRT. The Docker images can be customized by language and model name when building them, and there is an infrastructure in place to support image creation in various configurations.

Notably, Whisper models such as *.en and large-v1 have been excluded from the standard build due to their less frequent use. Users needing these specific models can build them on their own using the detailed build instructions provided in the project's repository.

Maintaining Download Cache

A practical feature included in the Docker setup is the ability to preserve the download cache for models when working with different languages by sharing align models between containers. This is managed by mounting a cache directory and using a specific image tag.

Building the Docker Image

Building the Docker image requires cloning the GitHub repository with its submodules. Users can specify the language and model through build arguments. For those using older Docker clients, enabling BuildKit mode is necessary to use certain build performance features.

Multiple Docker images can be built simultaneously using a Docker bake file, but caution is advised as this is an experimental feature and may require additional settings to function correctly.

Using the Docker Image

Once built, the Docker image can be run by mounting the working directory and specifying the necessary arguments to execute WhisperX for processing audio files.

Red Hat UBI-Based Image

An alternative version of the Docker image based on Red Hat Universal Base Image (UBI) is available for those who prefer the benefits offered by Red Hat, such as better security and performance. Though this version is not the default, it is available for users wishing to integrate it into Red Hat environments, with fewer vulnerabilities noted in comparative scans with the default Python-based Docker image.

Licensing

The WhisperX project and the corresponding Docker image and CI workflow files come under specific licenses. WhisperX is distributed under the BSD-4 license, and the Dockerfile and related files in the repository are licensed under the MIT license.

In summary, the docker-whisperX project is a highly optimized and configurable setup for using WhisperX via Docker, with a range of features to accommodate different environments and users' needs.