whisper-node - Improve Local Transcription Accuracy Using Node.js and OpenAI Whisper Integration

Introduction to Whisper-Node

Whisper-Node is an innovative Node.js binding designed for OpenAI's Whisper, offering local transcription capabilities. This project is tailored for developers looking to integrate advanced voice-to-text transcription into their applications while maintaining high performance.

Features

Whisper-Node stands out with several distinct features:

Multiple Output Formats: Users can output transcriptions in multiple formats, including JSON, .txt, .srt, and .vtt, making it versatile for various applications.
CPU Optimization: It is optimized for CPU usage, which includes support for Apple Silicon ARM, ensuring efficient processing without requiring extensive hardware resources.
Detailed Timestamps: The ability to generate timestamps with single-word precision allows for high accuracy in transcription, being particularly useful for tasks requiring detailed time tracking of spoken content.

Installation Guide

To incorporate Whisper-Node into a project, follow these simple steps:

Add the Whisper-Node dependency to your project via npm:
```
npm install whisper-node
```
Optionally, download the Whisper model of your choice:
```
npx whisper-node download
```
Note: Windows users need to install the make command from here.

Usage

Whisper-Node offers a straightforward implementation process. Here’s a basic example of how to use it to transcribe an audio file:

import whisper from 'whisper-node';

const transcript = await whisper("example/sample.wav");

console.log(transcript); // output: [ {start,end,speech} ]

JSON Output Example

The transcription output can be structured in JSON format, providing timestamps and transcribed speech:

[
  {
    "start":  "00:00:14.310", // Time stamp begin
    "end":    "00:00:16.480", // Time stamp end
    "speech": "howdy"         // Transcription
  }
]

Configuration Options

Whisper-Node offers customizable options to fine-tune its operation:

import whisper from 'whisper-node';

const options = {
  modelName: "base.en",       // default model
  whisperOptions: {
    language: 'auto',         // Automatic language detection
    gen_file_txt: false,      // Option to output .txt file
    gen_file_subtitle: false, // Option to output .srt file
    gen_file_vtt: false,      // Option to output .vtt file
    word_timestamps: true     // Timestamps for every word
  }
}

const transcript = await whisper("example/sample.wav", options);

File Format Requirements

To ensure optimal performance, input files should be in .wav format at 16Hz. For instance, you can convert an .mp3 file to the required format using FFmpeg:

ffmpeg -i input.mp3 -ar 16000 output.wav

Development and Acknowledgements

Whisper-Node is built with contributions from various technologies, mainly using Whisper OpenAI's C++ port by ggerganov and ShellJS. The project has been acknowledged by Georgi Gerganov and Ari for their contributions.

Roadmap

Whisper-Node has a comprehensive roadmap aimed at expanding its functionalities, such as:

Supporting projects not utilizing TypeScript.
Enabling custom directories for model storage.
Introducing more compatibility across browsers, React Native, and WebAssembly.
Adding speaker diarization features and enhanced timestamp precision with WhisperX.
Further development to include audio stream transcription.

These continued advancements make Whisper-Node a forward-thinking solution for audio transcription needs in modern applications.