DiffSinger - Leveraging Shallow Diffusion for Singing Voice Synthesis with PyTorch

DiffSinger: An Introduction

Overview

DiffSinger is a project centered around the implementation of a PyTorch-based model for singing voice synthesis, specifically through a mechanism known as shallow diffusion. The model is designed to transform text inputs into singing voice outputs, leveraging advancements from related technologies like FastSpeech2. This project caters to the specific task of generating human-like singing voices from textual input, with a focus on efficient synthesis processing.

Core Features

Naive Version of DiffSpeech: This initial version introduces users to the primary functionalities of DiffSinger, laying the foundation for more complex operations.
Auxiliary Decoder: Borrowed from the FastSpeech2 architecture, this component aids in enhancing the performance and prediction capabilities of the synthesis model.
Shallow Diffusion Mechanism: A significant feature that sets DiffSinger apart is its shallow diffusion approach. This method involves using a pre-trained auxiliary decoder while training a denoising component, optimizing the model's efficiency by capping the processing at a designated time step K.
Multi-Speaker Training: Although not yet fully realized within the project, this feature aims to enable the synthesis of voices across multiple speakers, broadening the applicability of the model.

Getting Started

To run DiffSinger, users need to install Python dependencies using a standard package installer and download the pre-trained models for initial synthesis.

Inference

Single Inference: Users can synthesize a single piece of text into singing by inputting the text and specifying the model and dataset parameters.
Batch Inference: For processing multiple texts at once, DiffSinger supports batch inference, making it efficient for larger datasets.

Controllability

Users can alter pitch, volume, and speaking rate, allowing for greater customization of the generated singing, though these features originate from FastSpeech2 and are not the main focus of DiffSinger.

Training

The training section outlines how users can develop different model versions - naive, auxiliary, and shallow. Users must follow distinct steps for training each model type, from setting up the dataset to utilizing alignment tools like the Montreal Forced Aligner. This ensures the phoneme sequences align accurately with the audio inputs.

Visualization with TensorBoard

For users interested in monitoring model training, TensorBoard is available to visualize loss curves, audio spectrograms, and other training metrics. This tool is crucial for understanding the model's progression and performance over time.

Visual Insights

Loss Curves: These graphs help users see how the model is learning over time.
Spectrograms and Audio: Provide visual and auditory feedback on the quality of the synthesized audio.

Notes on Implementation

The naive version exhibits a parameter size similar to the original DiffSpeech, providing comparable performance metrics.
Some limitations are noted, such as the predicted boundary in the current LJSpeech dataset.
The model uses HiFi-GAN for vocoding, a more efficient alternative to Parallel WaveGAN.

Conclusion

DiffSinger represents a meaningful step in singing voice synthesis technology, offering innovative solutions like shallow diffusion to improve processing efficiency. While still in development stages for some features, such as multi-speaker synthesis, DiffSinger provides a comprehensive toolkit for users looking to generate singing voices from text inputs efficiently.