Whisper-Finetune - Refine Whisper Speech Recognition with Fine-Tuning and Enhanced Inference Efficiency

Whisper-Finetune: Fine-tuning and Accelerating Whisper Speech Recognition Models

Introduction

Recently, OpenAI has open-sourced the Whisper project, which boasts human-level English speech recognition capabilities and supports automatic speech recognition in 98 additional languages. Whisper is designed to handle tasks like transforming spoken language into text and translating these texts into English. The Whisper-Finetune project focuses on fine-tuning the Whisper model using Lora, accommodating scenarios like training with timestamped data, training without timestamp data, and even training without speech data. Several models are available for public use, which can be accessed through openai. Additionally, this project incorporates methods like CTranslate2 and GGML for accelerating inference, which can be applied directly to the original Whisper models without requiring fine-tuning. This project supports desktop applications on Windows, Android apps, and server deployments.

Supported Models

openai/whisper-tiny
openai/whisper-base
openai/whisper-small
openai/whisper-medium
openai/whisper-large
openai/whisper-large-v2
openai/whisper-large-v3

Project Overview

Main Programs

aishell.py: Prepares the AIShell dataset for training.
finetune.py: Fine-tunes the Whisper models.
merge_lora.py: Merges Whisper and Lora models.
evaluation.py: Evaluates the performance of either the fine-tuned or the original Whisper model.
infer.py: Predicts using the fine-tuned models or the Whisper models available on the transformers platform.
infer_ct2.py: Provides examples for predicting using CTranslate2-converted models.
infer_gui.py: Offers a GUI for prediction with fine-tuned or transformer Whisper models.
infer_server.py: Deploys the fine-tuned or transformer Whisper models to a server.
convert-ggml.py: Converts models to GGML format for Android or Windows applications.
AndroidDemo: Contains the source code for deploying models on Android devices.
WhisperDesktop: Contains the source code for desktop applications on Windows.

Model Evaluation and Performance

Whisper-Finetune provides detailed evaluations for both original and fine-tuned models. Tests covering word error rates on different datasets and languages are conducted to assess performance improvements post-fine-tuning. Furthermore, various inference acceleration strategies are tested using GPUs, highlighting significant reductions in processing time for 3-minute audio files.

Data Preparation

The training dataset is formatted as JSON lines, with each line representing an audio file accompanied by metadata like transcription and optional timestamps. A script is provided for preparing the AIShell dataset, which downloads and structures data into this format for immediate use in training.

Installation and Environment Setup

Pytorch Installation: Users can install Pytorch with GPU support via Anaconda or Docker, based on preference.
Dependency Installation: Required libraries are installed through pip, and specific installations for Windows systems are provided.

Training Models

Whisper models can be fine-tuned on single or multiple GPU setups. Single GPU training can be easily initiated, while multi-GPU setups require either the torchrun or accelerate systems to manage distributed processing.

Overall, the Whisper-Finetune project offers a comprehensive suite for enhancing and deploying OpenAI's Whisper speech recognition capabilities, providing diverse applications ranging from personal use on desktops to robust cloud-based server deployments.