Introduction to Whisper-Node
Whisper-Node is an innovative Node.js binding designed for OpenAI's Whisper, offering local transcription capabilities. This project is tailored for developers looking to integrate advanced voice-to-text transcription into their applications while maintaining high performance.
Features
Whisper-Node stands out with several distinct features:
- Multiple Output Formats: Users can output transcriptions in multiple formats, including JSON, .txt, .srt, and .vtt, making it versatile for various applications.
- CPU Optimization: It is optimized for CPU usage, which includes support for Apple Silicon ARM, ensuring efficient processing without requiring extensive hardware resources.
- Detailed Timestamps: The ability to generate timestamps with single-word precision allows for high accuracy in transcription, being particularly useful for tasks requiring detailed time tracking of spoken content.
Installation Guide
To incorporate Whisper-Node into a project, follow these simple steps:
-
Add the Whisper-Node dependency to your project via npm:
npm install whisper-node
-
Optionally, download the Whisper model of your choice:
npx whisper-node download
Note: Windows users need to install the
make
command from here.
Usage
Whisper-Node offers a straightforward implementation process. Here’s a basic example of how to use it to transcribe an audio file:
import whisper from 'whisper-node';
const transcript = await whisper("example/sample.wav");
console.log(transcript); // output: [ {start,end,speech} ]
JSON Output Example
The transcription output can be structured in JSON format, providing timestamps and transcribed speech:
[
{
"start": "00:00:14.310", // Time stamp begin
"end": "00:00:16.480", // Time stamp end
"speech": "howdy" // Transcription
}
]
Configuration Options
Whisper-Node offers customizable options to fine-tune its operation:
import whisper from 'whisper-node';
const options = {
modelName: "base.en", // default model
whisperOptions: {
language: 'auto', // Automatic language detection
gen_file_txt: false, // Option to output .txt file
gen_file_subtitle: false, // Option to output .srt file
gen_file_vtt: false, // Option to output .vtt file
word_timestamps: true // Timestamps for every word
}
}
const transcript = await whisper("example/sample.wav", options);
File Format Requirements
To ensure optimal performance, input files should be in .wav format at 16Hz. For instance, you can convert an .mp3 file to the required format using FFmpeg:
ffmpeg -i input.mp3 -ar 16000 output.wav
Development and Acknowledgements
Whisper-Node is built with contributions from various technologies, mainly using Whisper OpenAI's C++ port by ggerganov and ShellJS. The project has been acknowledged by Georgi Gerganov and Ari for their contributions.
Roadmap
Whisper-Node has a comprehensive roadmap aimed at expanding its functionalities, such as:
- Supporting projects not utilizing TypeScript.
- Enabling custom directories for model storage.
- Introducing more compatibility across browsers, React Native, and WebAssembly.
- Adding speaker diarization features and enhanced timestamp precision with WhisperX.
- Further development to include audio stream transcription.
These continued advancements make Whisper-Node a forward-thinking solution for audio transcription needs in modern applications.