GST-Tacotron - Unsupervised Style Modeling in Speech Synthesis with Blizzard Dataset Support for Chinese Speech

Introduction to GST-Tacotron Project

The GST-Tacotron project provides a PyTorch implementation of a sophisticated model, known as GST-Tacotron, which stands for Global Style Tokens in Tacotron. This model is detailed in the research paper titled "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis." The project's primary goal is to model, control, and transfer speech styles effectively in end-to-end speech synthesis systems.

Project Overview

The GST-Tacotron project aims to enhance speech synthesis by integrating style tokens, which help in capturing various speech styles without supervision. This capability allows for a more dynamic and versatile generation of speech.

Recent Updates

The project has been updated to support the Blizzard dataset, expanding its utility and enabling users to work with varied and large-scale datasets.

Installation Requirements

To get started with the GST-Tacotron project, users need to install the necessary Python packages outlined in the requirements.txt file. This can be done conveniently using pip:

pip3 install -r requirements.txt

File Structure

GST-Tacotron's implementation is organized into various modules and scripts, each serving a distinct purpose:

Hyperparameters.py: Contains the hyperparameters necessary for training and synthesizing speech.
Network.py: Defines the encoder and decoder architectures.
Modules.py: Includes additional modules specifically for Tacotron.
Loss.py: Specifies the loss function utilized during training.
Data.py: Facilitates the loading and processing of datasets.
utils.py: Offers utility functions for data input and output operations.
Synthesis.py: Handles the actual speech generation process.

Training the Model

To train the GST-Tacotron model, users need to follow these steps:

Dataset Preparation: Download a multi-speaker dataset and preprocess it. Implement the get_XX_data function in Data.py to manage the dataset.
Hyperparameters Setting: Adjust the necessary hyperparameters in Hyperparameters.py according to your specific training needs.
Directory Setup: Create a directory named log to store logs and training outputs, with a structure as shown below:

--- log
|    |
|    --- log[log_number]
|
--- code
     |
     --- Tacotron
         |
         --- train.py
         |
         --- Network.py
         |
       ......

Initiate Training: Execute the train.py script with specified arguments such as the log number, dataset size, and starting epoch. For example:

python3 train.py 0 all 0

Generating Audio

To generate speech from the trained model, users can run generate.py. Before execution, the script should be modified to include the desired Chinese text, as the pre-trained model currently supports only Chinese speech synthesis.

Community Engagement

The project has garnered interest in the developer community, reflected in its star history chart, which shows the project's growth and popularity over time. This involvement underscores its relevance and practical application in the field of speech synthesis.

In summary, the GST-Tacotron project offers a robust framework for exploring and advancing the capabilities of speech synthesis, particularly in the realm of style variation and adaptation.