deep-speaker - Using Tensorflow/Keras for Neural Speaker Embeddings in Identification and Verification

Deep Speaker: An End-to-End Neural Speaker Embedding System

Overview

Deep Speaker is a sophisticated, yet not official, implementation within TensorFlow and Keras frameworks. It serves as a neural speaker embedding system that transforms spoken utterances into a multi-dimensional space. Here, the similarity between speakers is determined by calculating the cosine similarity. This system is particularly valuable for tasks such as speaker identification, verification, and clustering. The project is verified to operate smoothly with TensorFlow versions 2.3 up to 2.6.

Sample Results

The deep models in Deep Speaker are fine-tuned using datasets composed of clear speech. Nevertheless, it's crucial to note that their performance might decrease in the presence of noisy data. It is advisable to preprocess the audio data to remove silence and background noise for optimal results. Some evaluations have been illustrated to demonstrate how well the models perform on a prominent dataset called LibriSpeech. For instance, a ResCNN model trained with Softmax exhibited an accuracy of 99.6%, while the one enhanced with Triplet training achieved 99.7% accuracy.

Getting Started

System Requirements

Before proceeding with Deep Speaker, users should ensure they have:

At least 300GB of SSD storage
32GB of memory, alongside a minimum of 32GB swap space
A NVIDIA GPU, such as the 1080Ti

Installation

The project requires Python (min 3.6), TensorFlow (min 2.0), and Keras (min 2.3.1). Users can install the necessary dependencies using:

pip install -r requirements.txt

In cases where an error related to libsndfile emerges, executing sudo apt-get install libsndfile-dev is recommended to resolve it.

Training Procedure

The training scripts provided in the repository let users build models locally. A reasonable setup with a GTX1070 GPU may require about a week of training time. The following sequence of commands guides users through downloading the dataset, constructing model inputs, and executing the training:

pip uninstall -y tensorflow && pip install tensorflow-gpu
./deep-speaker download_librispeech
./deep-speaker build_mfcc
./deep-speaker build_model_inputs
./deep-speaker train_softmax
./deep-speaker train_triplet

Testing with a Pretrained Model

To evaluate using a pretrained model, download the provided models. For example, a ResCNN model trained on the entirety of the LibriSpeech dataset involves the following testing approach in Python:

import random
import numpy as np
from deep_speaker.audio import read_mfcc
from deep_speaker.batcher import sample_from_mfcc
from deep_speaker.constants import SAMPLE_RATE, NUM_FRAMES
from deep_speaker.conv_models import DeepSpeakerModel
from deep_speaker.test import batch_cosine_similarity

np.random.seed(123)
random.seed(123)

model = DeepSpeakerModel()
model.m.load_weights('ResCNN_triplet_training_checkpoint_265.h5', by_name=True)

mfcc_001 = sample_from_mfcc(read_mfcc('samples/PhilippeRemy/PhilippeRemy_001.wav', SAMPLE_RATE), NUM_FRAMES)
predict_001 = model.m.predict(np.expand_dims(mfcc_001, axis=0))

mfcc_002 = sample_from_mfcc(read_mfcc('samples/PhilippeRemy/PhilippeRemy_002.wav', SAMPLE_RATE), NUM_FRAMES)
predict_002 = model.m.predict(np.expand_dims(mfcc_002, axis=0))

mfcc_003 = sample_from_mfcc(read_mfcc('samples/1255-90413-0001.flac', SAMPLE_RATE), NUM_FRAMES)
predict_003 = model.m.predict(np.expand_dims(mfcc_003, axis=0))

print('SAME SPEAKER', batch_cosine_similarity(predict_001, predict_002))
print('DIFF SPEAKER', batch_cosine_similarity(predict_001, predict_003))

Future Directions

Deep Speaker continuously evolves with innovations such as integrating LSTM models and implementing fusion scores, reflecting ongoing contributions from its user community.

Contributors

Various contributors have actively participated in enhancing Deep Speaker, and their collaborative efforts are visible through an interactive contributions graph.

With its robust framework and promising results, Deep Speaker stands as a significant step forward in speaker recognition technology.