Project Introduction: Deep Learning for Emotional Text-to-Speech
The "Deep Learning for Emotional Text-to-Speech" (dl-for-emo-tts) project is an innovative approach that explores the utilization of deep learning techniques to enhance text-to-speech (TTS) systems with the capability to express emotions. By leveraging neural network models, the project aims to generate more expressive and human-like speech outputs that convey a range of emotions. This introduction delves into the various components and strategies employed in this project.
Datasets
The project incorporates multiple datasets that are essential for training and fine-tuning the models:
- RAVDESS: A dataset featuring 24 speakers and eight distinct emotions such as calm, happy, and sad.
- EMOV-DB: Includes five speakers with five emotions and contains non-verbal cues.
- LJ Speech: Consists of extensive neutral recordings which provide a rich vocabulary but lack emotional annotations.
- IEMOCAP: Features a wide variety of utterances with rich emotional annotations.
Relevant Literature
The project builds upon seminal papers in neural TTS, including Tacotron models and style token integration for enhanced emotional expressiveness and efficiency. Key literature also explores convolutional networks with guided attention for improved TTS systems.
Approach: Tacotron Models
Various approaches were employed using Tacotron models to fine-tune emotional speech generation:
- Approach 1: Initially, a basic Tacotron model was fine-tuned on an emotional dataset after being pre-trained, revealing issues with model forgetfulness.
- Approach 2-3: Adjustments in learning rates and optimizers like SGD aimed to enhance model retention, without significant improvements.
- Approach 4-5: Further exploration involved freezing certain parts of the model to preserve pre-learned knowledge, again with mixed results.
- Approach 6: Successful attempts included freezing just the post-net while training on EMOV-DB, showcasing intelligible speech with emotions.
Approach: DCTTS Models
DCTTS models were also evaluated:
- Approach 7: Initial efforts focused on fine-tuning using DC-TTS models with EMOV-DB data, encountering challenges with producing meaningful output.
- Approach 8: Refinement involved optimizing the speaker selection and attention mechanisms, leading to more coherent emotional TTS outputs.
Reproducibility and Code
The project emphasizes transparency and reproducibility, providing access to code and datasets for verifying results and further exploration by the community.
Demonstration
An interactive demonstration of the models is made available, allowing users to experience the emotional TTS capabilities developed through the project.
Conclusion
The "Deep Learning for Emotional Text-to-Speech" project represents a significant step forward in making TTS systems more human-like by imparting emotional context. Despite challenges and learning curves, the methodologies explored have paved the way for more natural and expressive speech synthesis solutions.