WaveRNN - WaveRNN Open-Source Neural Audio Synthesis with Pytorch

Introduction to WaveRNN

WaveRNN is a cutting-edge project implemented using PyTorch, inspired by DeepMind's renowned model for efficient neural audio synthesis. This project extends functionalities to Text-to-Speech (TTS) systems, incorporating the Vanilla Tacotron One model, providing more comprehensive capabilities over time.

Installation

To get started with WaveRNN, one should have Python version 3.6 or above and PyTorch 1 with CUDA support. After ensuring these prerequisites, the other dependencies can be conveniently installed via pip by executing:

pip install -r requirements.txt

How to Use

Quick Start

WaveRNN offers a straightforward approach for those eager to explore its TTS functionalities. By running:

python quick_start.py

It processes sentences listed in a default sentences.txt file and saves the results in a 'quick_start' folder. Users can listen to the output wave files and examine attention plots to understand the model's performance better. For users wanting to experiment with custom sentences or seek better audio quality, the script can be modified as follows:

python quick_start.py -u --input_text "What will happen if I run this command?"

Training Your Own Models

To train models within WaveRNN, one must first download the LJSpeech Dataset. After editing the hparams.py file to direct wav_path to the dataset, preprocessing can be executed with:

python preprocess.py

or directly specify the data path:

preprocess.py --path

Here's the recommended order for training:

Train Tacotron by running:
```
python train_tacotron.py
```
Optionally, you may create a GTA dataset during or after Tacotron training with:
```
python train_tacotron.py --force_gta
```
Train WaveRNN using the GTA dataset with:
```
python train_wavernn.py --gta
```
If TTS functionalities aren't a focus, it is feasible to train without the GTA dataset.

Generate Sentences using both models with:

python gen_tacotron.py wavernn

For customizing output sentences:

python gen_tacotron.py --input_text "this is whatever you want it to be" wavernn

Help for additional options can be accessed using the --help flag with any script.

Samples

Audio samples generated using this project can be found here.

Pretrained Models

WaveRNN provides two pretrained models within the /pretrained/ directory:

WaveRNN: Trained with Mixture of Logistics output for 800,000 steps.
Tacotron: Trained for 180,000 steps.

Both models utilize the LJSpeech dataset, ensuring high-quality output and robust audio synthesis capabilities.

References and Acknowledgements

WaveRNN is derived from influential research and contributions in the audio synthesis field. Notable references include:

Efficient Neural Audio Synthesis by DeepMind
Tacotron: End-to-End Speech Synthesis
Natural TTS Synthesis conditioned on Mel Spectrogram Predictions

Acknowledgments go to various GitHub contributors and resources like Keith Ito's Tacotron and r9y9’s Wavenet Vocoder. Special thanks are extended to G-Wang, geneing, and erogol for their contributions.