megatts2 - Utilize Zero-Shot Text-to-Speech with Cutting-Edge Synthesis Methods for Enhanced Audio Output

Introduction to megatts2

Megatts2 is an unofficial implementation of a sophisticated text-to-speech project designed to transform text inputs into natural-sounding speech. It stands out by focusing on zero-shot text-to-speech capabilities, enabling the generation of speech from new text inputs with minimal pre-existing training data.

Project Features

The project is organized into two primary phases: the base test and a more advanced version.

Base Test

The base test phase focuses on fundamental functionalities to ensure the system's basic operability. Key elements of this phase include:

Dataset Preparation: Initial steps involve gathering and organizing audio (wav files) and corresponding text data.
VQ-GAN: A vector-quantization-based Generative Adversarial Network is used to improve audio quality.
ADM and PLM: These are likely references to specialized models used in the text-to-speech conversion process, contributing to the project's core capabilities.

All tasks in this phase have been successfully completed.

Better Version

The next phase aims to enhance the existing system with several improvements:

Enhanced Sound Quality: By considering the replacement of Hifigan with Bigvgan, the project aims to produce more natural and high-quality audio outputs.
Multilingual Training: Combining training data in both Chinese and English to broaden the linguistic capabilities.
Extended Training: Targeting training on approximately 1,000 hours of speech to enrich the speech database and boost accuracy.
Web Interface (WebUI): Planning to integrate a web-based user interface for ease of access and operational simplicity.

Installation and Setup

For setting up the alignment environment, the Montreal Forced Aligner (MFA) is used, which involves the following steps:

Creation of a Conda environment named "aligner."
Installation of the MFA version 2.2.17 from the conda-forge channel.

Dataset Preparation Steps

Collect wav and text files in the designated directory (./data/wav).
Execute a Python script to organize and prepare the dataset for alignment.
Download the acoustic model for Mandarin from the MFA resources.
Align the speech data using Mandarin phonetic transcriptions.
Execute further preparation scripts to finalize the dataset configuration.

Training and Testing

Training of the model is anchored on procedures derived from Pytorch-lightning, which streamlines the process with high-level abstractions.

Inference Test

After training, an inference test can be conducted using the python infer.py script to validate the model's performance and capability in generating speech from text inputs.

Citation

The creators of this project have provided a citation format for referencing in academic and professional contexts, highlighting contributions from multiple authors led by Ziyue Jiang.

Licensing and Support

The project is released under the MIT License, ensuring open access to the source code. It is supported by Simon from ZideAI, indicating a collaborative effort in advancing AI technology.

Megatts2 embodies a significant technological step in the field of text-to-speech, promising enhanced speech synthesis capabilities with ongoing improvements focusing on quality and versatility.