Whisper-Finetune - Elevate Whisper ASR with Lora Fine-Tuning and Accelerated Inference

Whisper-Finetune: Enhancing Speech Recognition with OpenAI's Whisper

Whisper-Finetune is a project designed to enhance the capabilities of OpenAI's Whisper, a state-of-the-art speech recognition model. OpenAI's Whisper is renowned for achieving human-level proficiency in English speech recognition and provides support for 98 other languages. The Whisper-Finetune project aims to fine-tune the Whisper model using Lora, enabling training with and without timestamp data, as well as training without voice data. The project also includes acceleration options for inference through CTranslate2 and GGML, facilitating deployment across various platforms like Windows, Android, and servers.

Recent Updates

[2024/10/16] Released Belle-whisper-large-v3-turbo-zh, enhancing Chinese recognition with up to 64% improvement and boosting recognition speed by 7-8 times.
[2024/06/11] Released Belle-whisper-large-v3-zh-punct, improving Chinese punctuation capabilities.
[2024/03/11] Released Belle-whisper-large-v3-zh, achieving notable advancement in complex scenarios.
[2023/12/29] Released Belle-whisper-large-v2-zh, with significant improvements in Chinese recognition capabilities.

Supported Models

openai/whisper-large-v2
openai/whisper-large-v3
openai/whisper-large-v3-turbo
distil-whisper

Environment Requirements:

Anaconda 3
Python 3.10
Pytorch 2.1.0
GPU A100-PCIE-80GB

Project Components

aishell.py: Prepares AIShell training data.
finetune.py: Uses PEFT for model fine-tuning.
finetune_all.py: Full parameter fine-tuning.
merge_lora.py: Merges Whisper and Lora models.
evaluation.py: Evaluates fine-tuned models or the original Whisper model.
infer_tfs.py: Makes predictions using transformers for short audio.
infer_ct2.py: Predicts using CTranslate2 model.
infer_gui.py: GUI-based CTranslate2 model prediction.
infer_server.py: Deploys CTranslate2 model for server-side use.
convert-ggml.py: Converts models to GGML format for mobile and desktop applications.
AndroidDemo: Contains source code for Android deployment.
WhisperDesktop: Contains Windows desktop application programs.

Model Overview

The models, with parameters ranging from 756M to 1550M, are fine-tuned on various datasets like AISHELL-1, AISHELL-2, WenetSpeech, and HKUST. They are evaluated based on their Character Error Rate (CER) across different test sets.

Key Performance Metrics: CER(%)

The Belle-whisper-large-v3-turbo-zh model demonstrates a substantial improvement in speed and Chinese recognition performance compared to its predecessors.
Belle-whisper-large-v2-zh boasts the lowest CER, showcasing its superior accuracy in various test scenarios.

Installation Setup

Install Pytorch GPU Version:

Using Anaconda:

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

Using Docker:

sudo docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel

Install Required Libraries:

python -m pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Windows-Specific Installation:

Install bitsandbytes for Windows:

python -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.1.post1-py3-none-win_amd64.whl

Preparing Data

Data preparation involves creating a JSON-lined dataset formatted with fields such as audio path, sentence, and language details. The provided script aishell.py automates the downloading and preparation of the AIShell dataset for training and testing.

Model Fine-tuning

Once the data is ready, the model fine-tuning process can be initiated with specific parameters like --base_model for specifying the Whisper model and --output_path for Lora checkpoint storage.

Single-GPU Training

CUDA_VISIBLE_DEVICES=0 python finetune.py --base_model=openai/whisper-tiny --output_dir=output/

Multi-GPU Training

Using torchrun:

torchrun --nproc_per_node=2 finetune.py --base_model=openai/whisper-tiny --output_dir=output/

Using accelerate: Set up the configuration and start training:

accelerate config
accelerate launch finetune.py --base_model=openai/whisper-tiny --output_dir=output/

By understanding and leveraging the features and framework of Whisper-Finetune, developers can enhance speech recognition capabilities to meet specific linguistic needs while optimizing performance across various platforms.