SenseVoice - Accurate Multilingual Speech and Emotion Recognition Model

Introduction to SenseVoice

SenseVoice is an advanced speech foundation model that offers a range of speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). This model is a comprehensive solution for those seeking high-performance speech processing.

Highlights 🎯

SenseVoice stands out for its high-accuracy multilingual speech recognition, emotion recognition, and audio event detection:

Multilingual Speech Recognition: With training on over 400,000 hours of data, SenseVoice supports more than 50 languages and even surpasses the Whisper model's performance.
Rich Transcription Abilities: It excels in recognizing emotions, often reaching or surpassing the efficiency of leading emotion recognition models. It also detects sound events, supporting diverse human-computer interaction cues like background music, applause, laughter, crying, coughing, and sneezing.
Efficient Inference: SenseVoice-Small uses a non-autoregressive end-to-end framework which drastically reduces inference time. It processes 10 seconds of audio in only 70ms, making it 15 times faster than Whisper-Large.
Easy Finetuning: SenseVoice offers user-friendly finetuning scripts and strategies, helping users tailor the system to address specific needs in their business scenarios.
Service Deployment: Users are provided with a service deployment pipeline capable of handling multiple concurrent requests, and it supports various client-side languages such as Python, C++, HTML, Java, and C#.

What's New 🔥

Recent updates have included new export features for ONNX and libtorch, as well as new Python version runtimes. The open-sourced SenseVoice-Small model provides high-precision multilingual speech recognition, emotion recognition, and audio event detection capabilities. Also, the CosyVoice feature was introduced for natural speech generation with capabilities in multi-language, timbre, and emotion control. CosyVoice is excellent for multilingual voice generation, zero-shot voice generation, cross-lingual voice cloning, and instruction-following.

Benchmarks 📝

Multilingual Speech Recognition

Comparisons between SenseVoice and Whisper models on datasets like AISHELL-1, AISHELL-2, Wenetspeech, LibriSpeech, and Common Voice show that SenseVoice-Small performs exceptionally well in Chinese and Cantonese recognition.

Speech Emotion Recognition

Despite a lack of standardized benchmarks, SenseVoice has been evaluated on various test sets without fine-tuning and delivered impressive performances, often surpassing the latest benchmarks. It outperforms many open-source models.

Audio Event Detection

SenseVoice functions competently as an event detection model, having been tested on the ESC-50 dataset against models like BEATS and PANN. Although its performance in event classification has some limitations due to training data constraints, it performs commendably given its primary focus on speech data.

Computational Efficiency

SenseVoice-Small's non-autoregressive architecture results in low latency, with inference times more than five times faster than Whisper-Small and 15 times faster than Whisper-Large.

Requirements and Usage

To get started with SenseVoice, users can set up the environment with a simple installation command. The model supports audio input in various formats and durations, making it a flexible solution for different applications. Comprehensive guidance and context for parameter settings are provided to users, helping them optimize the model according to their specific needs.

In sum, SenseVoice represents a significant advancement in speech technology, offering cutting-edge features and capabilities for multilingual environments and emotion recognition, among others. Its efficient processing and deployment options make it an attractive choice for businesses and developers looking for high-performance speech recognition and analysis tools.