Introduction to WaveRNN
WaveRNN is a cutting-edge project implemented using PyTorch, inspired by DeepMind's renowned model for efficient neural audio synthesis. This project extends functionalities to Text-to-Speech (TTS) systems, incorporating the Vanilla Tacotron One model, providing more comprehensive capabilities over time.
Installation
To get started with WaveRNN, one should have Python version 3.6 or above and PyTorch 1 with CUDA support. After ensuring these prerequisites, the other dependencies can be conveniently installed via pip by executing:
pip install -r requirements.txt
How to Use
Quick Start
WaveRNN offers a straightforward approach for those eager to explore its TTS functionalities. By running:
python quick_start.py
It processes sentences listed in a default sentences.txt
file and saves the results in a 'quick_start' folder. Users can listen to the output wave files and examine attention plots to understand the model's performance better. For users wanting to experiment with custom sentences or seek better audio quality, the script can be modified as follows:
python quick_start.py -u --input_text "What will happen if I run this command?"
Training Your Own Models
To train models within WaveRNN, one must first download the LJSpeech Dataset. After editing the hparams.py
file to direct wav_path
to the dataset, preprocessing can be executed with:
python preprocess.py
or directly specify the data path:
preprocess.py --path
Here's the recommended order for training:
-
Train Tacotron by running:
python train_tacotron.py
-
Optionally, you may create a GTA dataset during or after Tacotron training with:
python train_tacotron.py --force_gta
-
Train WaveRNN using the GTA dataset with:
python train_wavernn.py --gta
If TTS functionalities aren't a focus, it is feasible to train without the GTA dataset.
-
Generate Sentences using both models with:
python gen_tacotron.py wavernn
For customizing output sentences:
python gen_tacotron.py --input_text "this is whatever you want it to be" wavernn
Help for additional options can be accessed using the --help
flag with any script.
Samples
Audio samples generated using this project can be found here.
Pretrained Models
WaveRNN provides two pretrained models within the /pretrained/ directory:
- WaveRNN: Trained with Mixture of Logistics output for 800,000 steps.
- Tacotron: Trained for 180,000 steps.
Both models utilize the LJSpeech dataset, ensuring high-quality output and robust audio synthesis capabilities.
References and Acknowledgements
WaveRNN is derived from influential research and contributions in the audio synthesis field. Notable references include:
- Efficient Neural Audio Synthesis by DeepMind
- Tacotron: End-to-End Speech Synthesis
- Natural TTS Synthesis conditioned on Mel Spectrogram Predictions
Acknowledgments go to various GitHub contributors and resources like Keith Ito's Tacotron and r9y9’s Wavenet Vocoder. Special thanks are extended to G-Wang, geneing, and erogol for their contributions.