Whisper-Finetune: Enhancing Speech Recognition with OpenAI's Whisper
Whisper-Finetune is a project designed to enhance the capabilities of OpenAI's Whisper, a state-of-the-art speech recognition model. OpenAI's Whisper is renowned for achieving human-level proficiency in English speech recognition and provides support for 98 other languages. The Whisper-Finetune project aims to fine-tune the Whisper model using Lora, enabling training with and without timestamp data, as well as training without voice data. The project also includes acceleration options for inference through CTranslate2 and GGML, facilitating deployment across various platforms like Windows, Android, and servers.
Recent Updates
- [2024/10/16] Released Belle-whisper-large-v3-turbo-zh, enhancing Chinese recognition with up to 64% improvement and boosting recognition speed by 7-8 times.
- [2024/06/11] Released Belle-whisper-large-v3-zh-punct, improving Chinese punctuation capabilities.
- [2024/03/11] Released Belle-whisper-large-v3-zh, achieving notable advancement in complex scenarios.
- [2023/12/29] Released Belle-whisper-large-v2-zh, with significant improvements in Chinese recognition capabilities.
Supported Models
- openai/whisper-large-v2
- openai/whisper-large-v3
- openai/whisper-large-v3-turbo
- distil-whisper
Environment Requirements:
- Anaconda 3
- Python 3.10
- Pytorch 2.1.0
- GPU A100-PCIE-80GB
Project Components
aishell.py
: Prepares AIShell training data.finetune.py
: Uses PEFT for model fine-tuning.finetune_all.py
: Full parameter fine-tuning.merge_lora.py
: Merges Whisper and Lora models.evaluation.py
: Evaluates fine-tuned models or the original Whisper model.infer_tfs.py
: Makes predictions using transformers for short audio.infer_ct2.py
: Predicts using CTranslate2 model.infer_gui.py
: GUI-based CTranslate2 model prediction.infer_server.py
: Deploys CTranslate2 model for server-side use.convert-ggml.py
: Converts models to GGML format for mobile and desktop applications.AndroidDemo
: Contains source code for Android deployment.WhisperDesktop
: Contains Windows desktop application programs.
Model Overview
The models, with parameters ranging from 756M to 1550M, are fine-tuned on various datasets like AISHELL-1, AISHELL-2, WenetSpeech, and HKUST. They are evaluated based on their Character Error Rate (CER) across different test sets.
Key Performance Metrics: CER(%)
- The Belle-whisper-large-v3-turbo-zh model demonstrates a substantial improvement in speed and Chinese recognition performance compared to its predecessors.
- Belle-whisper-large-v2-zh boasts the lowest CER, showcasing its superior accuracy in various test scenarios.
Installation Setup
-
Install Pytorch GPU Version:
- Using Anaconda:
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
- Using Docker:
sudo docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel
- Using Anaconda:
-
Install Required Libraries:
python -m pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-
Windows-Specific Installation:
- Install bitsandbytes for Windows:
python -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.1.post1-py3-none-win_amd64.whl
- Install bitsandbytes for Windows:
Preparing Data
Data preparation involves creating a JSON-lined dataset formatted with fields such as audio path, sentence, and language details. The provided script aishell.py
automates the downloading and preparation of the AIShell dataset for training and testing.
Model Fine-tuning
Once the data is ready, the model fine-tuning process can be initiated with specific parameters like --base_model
for specifying the Whisper model and --output_path
for Lora checkpoint storage.
Single-GPU Training
CUDA_VISIBLE_DEVICES=0 python finetune.py --base_model=openai/whisper-tiny --output_dir=output/
Multi-GPU Training
-
Using
torchrun
:torchrun --nproc_per_node=2 finetune.py --base_model=openai/whisper-tiny --output_dir=output/
-
Using
accelerate
: Set up the configuration and start training:accelerate config accelerate launch finetune.py --base_model=openai/whisper-tiny --output_dir=output/
By understanding and leveraging the features and framework of Whisper-Finetune, developers can enhance speech recognition capabilities to meet specific linguistic needs while optimizing performance across various platforms.