FastSpeech2: Fast and High-Quality Text-to-Speech
FastSpeech2 is an unofficial PyTorch implementation of the FastSpeech 2 model, originally designed for fast and high-quality end-to-end text-to-speech systems. This implementation builds upon the FastSpeech model by Espnet and makes modifications for enhanced results. It incorporates Nvidia's Tacotron 2 preprocessing for audio and uses MelGAN as the vocoder, which is essential for converting textural data into spoken language.
Project Overview
FastSpeech2 is developed to offer an efficient and high-quality solution for text-to-speech. The project includes a comprehensive set of features that showcases improvements over previous text-to-speech models, emphasizing speed without sacrificing audio quality.
Demonstration
A live demonstration of FastSpeech2 can be accessed through Google Colab. This allows interested users to interact with the model and observe its performance in real time. Open in Colab
Requirements
- Python 3.6.2 is required for running the code.
- PyTorch: Ensure the correct CUDA version is installed before proceeding with PyTorch installation. The project utilizes PyTorch 1.6.0, which supports the
torch.bucketize
feature.pip install torch torchvision
- Additional Requirements: Install all other dependencies with the provided requirements file.
pip install -r requirements.txt
- Tensorboard: For monitoring training and validation, Tensorboard version 1.14.0 can be installed separately, needing compatible TensorFlow (1.14.0).
Preprocessing
The project uses preprocessed LJSpeech dataset files for duration extraction, avoiding the need for manual alignment of text and audio. For datasets other than LJSpeech, instructions for alignment are provided. To preprocess audio files, the following commands are useful:
For general audio preprocessing:
python .\nvidia_preprocessing.py -d path_of_wavs
For computing the minimum and maximum values of F0 (pitch) and energy:
python .\compute_statistics.py
Training FastSpeech2
Training the FastSpeech2 model can be initiated with:
python train_fastspeech.py --outdir etc -c configs/default.yaml -n "name"
Performing Inference
Inference is currently supported for phoneme-based synthesis. An example command for performing inference:
python .\inference.py -c .\configs\default.yaml -p .\checkpoints\first_1\ts_version2_fastspeech_fe9a2c7_7k_steps.pyt --out output --text "Input text here."
An example inference demo is also available in Colab for easier user interaction.
Exporting to TorchScript
For exporting the model to TorchScript, use the command:
python export_torchscript.py -c configs/default.yaml -n fastspeech_script --outdir etc
Checkpoint and Sample Outputs
The project provides access to pre-trained model checkpoints and sample outputs for evaluation. These can be found in the linked Google Drive and project's sample folder respectively.
Monitoring with Tensorboard
Tensorboard provides visual insights into the training process, illustrating both training and validation phases.
- Training View: Visual performance metrics during the training process.
- Validation View: Graphs showing the model's performance on validation data.
Additional Notes
While this repo already produces high-quality audio, further refinements and optimizations are needed. The code was initially written to replicate the FastSpeech2 paper and is open to improvements and suggestions.
For a more comprehensive text-to-speech or voice cloning toolbox, users can visit Deepsync Technologies.
References
This project references numerous prior works, including the FastSpeech and ESPnet projects, and influential tools like NVIDIA's WaveGlow, MelGAN, and WaveRNN implementations.