dc_tts - DC-TTS Leveraging Deep Convolutional Networks for Efficient Text-to-Speech

Introduction to DC-TTS Project

DC-TTS (Deep Convolutional Text-to-Speech) is an advanced model aimed at converting text into speech using deep convolutional networks. The project takes inspiration from the paper titled "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention." Instead of merely replicating the research, DC-TTS strives to explore and implement insights into various sound projects.

Requirements

To effectively utilize this project, certain software requirements need to be met:

NumPy: Version 1.11.1 or newer.
TensorFlow: Version 1.3 or newer (note a change in tf.contrib.layers.layer_norm from version 1.3).
librosa: A Python package for music and audio analysis.
tqdm: A fast, extensible progress bar for Python.
matplotlib: A plotting library to create visualizations.
scipy: A Python library for scientific computing.

Data Utilization

The project involves training English and Korean models using four distinct datasets:

LJ Speech Dataset: A robust benchmark dataset, publicly available with 24 hours of quality audio samples.
Nick Offerman's Audiobooks: A dataset of 18 hours to observe model performance on a diverse set of speech samples.
Kate Winslet's Audiobook: A dataset containing 5 hours, used to test model adaptability to smaller data samples.
KSS Dataset: A Korean dataset with over 12 hours of audio from a single speaker.

Training Process

The training process in DC-TTS involves several steps:

Step 0: Download the LJ Speech Dataset or arrange for your own dataset.
Step 1: Adjust hyperparameters in the hyperparams.py file. Enable preprocessing if needed by setting prepro to True.
Step 2: Initiate training of the Text2Mel model by executing python train.py 1. If preprocessing is enabled, run python prepro.py first.
Step 3: Train the SSRN by running python train.py 2.

These training steps can be performed simultaneously if multiple GPU cards are available.

Training Outputs

Training Curves

The training process is documented with visualization of training curves, offering insights into model performance over time.

Training Curves

Attention Plot

The attention plot demonstrates the model's capability to align text input with audio output during the training process.

Attention Plot

Sample Synthesis

Utilizing the Harvard Sentences, the model generates speech samples, which are included within the repository. Users can run synthesize.py to produce these audio outputs and find results in the samples directory.

Listening to Generated Samples

DC-TTS offers various generated samples across different datasets:

LJ Dataset: Access samples here, at iteratives steps including 50k, 200k, 310k, and up to 800k.
Nick's Dataset: Listen to samples from 40k through 800k iterations.
Kate's Dataset: Samples range from 40k to 800k iterations.
KSS Dataset: Available samples include 400k iterations.

Pretrained Model Availability

Users can download a pretrained model for the LJ dataset from this link.

Important Notes

It's noteworthy that during the implementation, several modifications and observations were made:

Layer normalization was added for effectiveness.
The learning rate was decayed rather than fixed.
Training Text2Mel and SSRN simultaneously was less effective; thus, they were trained separately.
The training duration exceeded the initial claim of under one day but was still faster than other models due to the use of convolutional layers.
Due to guided attention, the alignment remained consistent and stable.
Dropouts were applied as a regularization measure.
For those interested, other text-to-speech models like Tacotron and Deep Voice 3 could also be considered for comparison.

DC-TTS provides a rich framework for generating speech from text with a focus on flexibility and adaptability across different languages and datasets.