whisper.unity - Enhance Local ASR Performance with Multilingual Support in Unity3D

Introduction to Whisper.Unity

Whisper.Unity is a set of Unity3D bindings designed for whisper.cpp, enabling efficient inference of OpenAI's Whisper automatic speech recognition (ASR) model directly on local machines. This project combines the power of Whisper's ASR with the versatility of Unity, allowing developers to incorporate high-performance speech recognition into their applications without an Internet connection.

Key Features

Multilingual Support: Whisper.Unity can process and recognize speech in approximately 60 different languages. Additionally, it can translate speech from one language to text in another, such as converting spoken German to written English.
Model Variability: It offers various model sizes, enabling developers to choose between processing speed and transcription accuracy based on their specific needs.
Local Execution: The ASR model runs entirely on a user's local device, making it independent of Internet access, which enhances privacy and reduces latency.
Open Source: Licensed under the MIT License, Whisper.Unity is free to use and can be integrated into commercial projects.

Supported Platforms

Whisper.Unity supports a wide range of platforms, ensuring flexibility and broad usage:

Windows (x86_64), with optional support for CUDA to further leverage Nvidia GPUs.
MacOS (both Intel and ARM architectures), featuring optional Metal support on newer Apple GPUs.
Linux (x86_64), also with optional CUDA support.
iOS for both physical devices and simulators.
Android (ARM64).
VisionOS is supported as well.

Currently, WebGL support is under discussion, with updates tracked through an ongoing issue on the project's GitHub page.

Samples and Performance

Whisper.Unity offers real-time performance as demonstrated through its sample videos. An example using the "whisper-small.bin" model shows effective transcription of English, German, and Russian from a microphone. Another sample, using the "whisper-tiny.bin" model, highlights its impressive speed—operating 50 times faster than real-time on a Macbook with an M1 Pro chip.

Getting Started

To begin using Whisper.Unity, developers can clone the repository and open it as a standard Unity project. It includes example projects and a small multi-language model for immediate use. Alternatively, it can be added to an existing project as a Unity Package via the following Git URL:

https://github.com/Macoron/whisper.unity.git?path=/Packages/com.whisper.unity

Enhancements with CUDA and Metal

Whisper.Unity supports advanced processing capabilities using CUDA and Metal:

CUDA Support: This requires an Nvidia GPU and the CUDA Toolkit (tested with version 12.2.0). Enabling CUDA can significantly accelerate the inference process on supported hardware.
Metal Support: On compatible Apple hardware (starting from M1 chips), Metal can be enabled for enhanced performance.

Expanding with Different Model Weights

Improving transcription accuracy or focusing on specific languages is possible by using different Whisper model weights. These can be downloaded from Whisper.cpp's resources and integrated into the Unity project by placing them in the StreamingAssets folder.

Compiling C++ Libraries

For developers interested in custom builds, Whisper.Unity includes instructions for compiling C++ libraries. The project provides prebuilt libraries but also supports rebuilding from source using GitHub Actions or manual compilation for various platforms.

Licensing

Whisper.Unity and its dependencies are open-sourced under the MIT License, ensuring developers the freedom to use, modify, and distribute it in diverse projects.

In conclusion, Whisper.Unity is a powerful tool for developers looking to integrate robust speech recognition capabilities within Unity applications, characterized by its flexibility, open-source nature, and local processing capabilities.