Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
Overview
The Speech Resynthesis project aims to develop a method for speech resynthesis using self-supervised discrete representations. This method leverages the disentangled representation of speech content, prosodic information, and speaker identity to synthesize speech in a controllable manner. The project evaluates the effectiveness of various state-of-the-art self-supervised learning methods, focusing on aspects such as reconstruction quality and how well different elements of speech are separated.
Key Features
-
Disentangled Representation: The project employs a technique to separately capture different elements of speech, such as content, prosody, and speaker identity, resulting in more flexible and controllable speech synthesis.
-
Speech Quality: The researchers assess speech quality using several benchmarks, including F0 reconstruction (a measure of pitch accuracy), speaker identification performance, and overall audio intelligibility and quality judged through subjective human evaluations.
-
Lightweight Speech Codec: This project also explores the potential of using these representations for an ultra-lightweight speech codec, achieving a bit rate as low as 365 bits per second while maintaining better quality compared to baseline methods.
Setup
Software Requirements
To set up the environment for this project, users need:
- Python version 3.6 or higher
- PyTorch version 1.8
Users should clone the project repository and install the necessary dependencies via the provided commands.
Data Preparation
-
LJSpeech Dataset: The LJSpeech dataset must be downloaded and downsized from 22.05 kHz to 16 kHz.
-
VCTK Dataset: Similarly, the VCTK dataset requires downsampling from 48 kHz to 16 kHz, along with trimming silences and padding.
Training
The training process involves different models for different tasks:
-
F0 Quantizer Model: This model is trained to handle pitch information, using a command specifying checkpoints and configuration files.
-
Resynthesis Model: Requires similar steps, focusing on reconstructing the audio using discrete representations.
Training configurations support various combinations of datasets and self-supervised learning methods such as HuBERT, CPC, and VQVAE.
Inference
Once the models are trained, users can generate synthesized speech through inference scripts. The system supports different synthesis tasks, including synthesizing multiple speakers or reshaping speech inputs from different datasets.
Preprocessing New Datasets
The project guides users on preprocessing new datasets using CPC, HuBERT, and VQVAE coding techniques. This involves parsing the outputs into usable formats, allowing the models to work with a wide variety of speech data inputs.
Licensing and Acknowledgments
The project codebase is available under a specific license, with more details on usage rights available in the license file. The implementation incorporates ideas and code from other projects like HiFi-GAN and Jukebox, highlighting a collaborative approach to speech synthesis advancements.
Citation
Users referencing this project in academic work can cite it through the provided format, acknowledging the contributions of the project developers and their comprehensive research.
The Speech Resynthesis project offers a robust framework for researchers and developers interested in advanced speech synthesis techniques, combining self-supervised learning with innovative representation disentanglement to enhance speech synthesis capabilities.