Expressive-FastSpeech2 - Advancing Non-autoregressive TTS for Emotional and Conversational Voice Synthesis

Introduction to Expressive-FastSpeech2

Expressive-FastSpeech2 is a PyTorch implementation focused on advancing text-to-speech (TTS) technology by delivering expressive and natural-sounding speech using non-autoregressive methods. The project mainly centers on developing Emotional TTS and Conversational TTS systems, utilizing notable datasets such as the AIHub Multimodal Video AI datasets for Korean and the IEMOCAP database for English.

Contributions of the Project

Non-autoregressive Expressive TTS: The core goal of this project is to lay the groundwork for future exploration and practical applications in non-autoregressive expressive TTS. It encompasses Emotional TTS, which focuses on imbuing speech synthesis with emotional expressiveness, and Conversational TTS, designed to produce speech suitable for conversations.
Annotated Data Processing: This project offers insights into handling new datasets, which may involve different languages, to successfully train non-autoregressive emotional TTS models. This entails methods for processing and preparing data to enhance the performance and expressiveness of the generated speech.
English and Korean TTS: The project expands its reach by including Korean alongside English TTS, addressing language-specific features. For instance, when working with Korean, additional data processing is vital to accommodate the unique linguistic characteristics. This involves tasks like training the Montreal Forced Aligner with language-specific datasets.
Adopting Other Languages: If researchers wish to adapt this model to other languages, they can conveniently refer to the section titled "Training with your own dataset (own language)" in the categorical branch of the project repository.

Project Repository Structure

FastSpeech2 is the foundational framework for this project, offering a robust non-autoregressive multi-speaker TTS platform. It is beneficial to familiarize oneself with the original FastSpeech2 paper and implementation to fully grasp the enhancements made in Expressive-FastSpeech2.

Emotional TTS

The project branch dedicated to Emotional TTS includes basic implementations based on the Emotional End-to-End Neural Speech synthesizer approach. It is split into two branches:

Categorical Branch: This segment conditions on categorical emotional descriptors such as emotions like happy or sad.
Continuous Branch: In addition to categorical descriptors, this branch also considers continuous emotional descriptors like arousal and valence.

Conversational TTS

The conversational branch focuses on synthesizing speech suitable for dialogue systems and voice agents by conditioning on chat history. It draws inspiration from the Conversational End-to-End TTS approach, specifically designed for voice agents.

Citing the Project

For those interested in utilizing or referencing this project, appropriate citation instructions are outlined in the repository to acknowledge the contributions of Keon Lee, the developer behind this work.

Acknowledgments

The development of Expressive-FastSpeech2 builds upon several seminal works and contributions in the field of TTS, such as FastSpeech2 by ming024, Korean-FastSpeech2-Pytorch by HGU-DLLAB, and other notable projects that have significantly influenced the advancement of multi-speaker and expressive speech synthesis.