Introduction to DC-TTS Project
DC-TTS (Deep Convolutional Text-to-Speech) is an advanced model aimed at converting text into speech using deep convolutional networks. The project takes inspiration from the paper titled "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention." Instead of merely replicating the research, DC-TTS strives to explore and implement insights into various sound projects.
Requirements
To effectively utilize this project, certain software requirements need to be met:
- NumPy: Version 1.11.1 or newer.
- TensorFlow: Version 1.3 or newer (note a change in
tf.contrib.layers.layer_norm
from version 1.3). - librosa: A Python package for music and audio analysis.
- tqdm: A fast, extensible progress bar for Python.
- matplotlib: A plotting library to create visualizations.
- scipy: A Python library for scientific computing.
Data Utilization
The project involves training English and Korean models using four distinct datasets:
- LJ Speech Dataset: A robust benchmark dataset, publicly available with 24 hours of quality audio samples.
- Nick Offerman's Audiobooks: A dataset of 18 hours to observe model performance on a diverse set of speech samples.
- Kate Winslet's Audiobook: A dataset containing 5 hours, used to test model adaptability to smaller data samples.
- KSS Dataset: A Korean dataset with over 12 hours of audio from a single speaker.
Training Process
The training process in DC-TTS involves several steps:
- Step 0: Download the LJ Speech Dataset or arrange for your own dataset.
- Step 1: Adjust hyperparameters in the
hyperparams.py
file. Enable preprocessing if needed by settingprepro
to True. - Step 2: Initiate training of the Text2Mel model by executing
python train.py 1
. If preprocessing is enabled, runpython prepro.py
first. - Step 3: Train the SSRN by running
python train.py 2
.
These training steps can be performed simultaneously if multiple GPU cards are available.
Training Outputs
Training Curves
The training process is documented with visualization of training curves, offering insights into model performance over time.
Attention Plot
The attention plot demonstrates the model's capability to align text input with audio output during the training process.
Sample Synthesis
Utilizing the Harvard Sentences, the model generates speech samples, which are included within the repository. Users can run synthesize.py
to produce these audio outputs and find results in the samples
directory.
Listening to Generated Samples
DC-TTS offers various generated samples across different datasets:
- LJ Dataset: Access samples here, at iteratives steps including 50k, 200k, 310k, and up to 800k.
- Nick's Dataset: Listen to samples from 40k through 800k iterations.
- Kate's Dataset: Samples range from 40k to 800k iterations.
- KSS Dataset: Available samples include 400k iterations.
Pretrained Model Availability
Users can download a pretrained model for the LJ dataset from this link.
Important Notes
It's noteworthy that during the implementation, several modifications and observations were made:
- Layer normalization was added for effectiveness.
- The learning rate was decayed rather than fixed.
- Training Text2Mel and SSRN simultaneously was less effective; thus, they were trained separately.
- The training duration exceeded the initial claim of under one day but was still faster than other models due to the use of convolutional layers.
- Due to guided attention, the alignment remained consistent and stable.
- Dropouts were applied as a regularization measure.
- For those interested, other text-to-speech models like Tacotron and Deep Voice 3 could also be considered for comparison.
DC-TTS provides a rich framework for generating speech from text with a focus on flexibility and adaptability across different languages and datasets.