speech-resynthesis - Innovative Self-Supervised Learning for Enhanced Speech Resynthesis

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Overview

The Speech Resynthesis project aims to develop a method for speech resynthesis using self-supervised discrete representations. This method leverages the disentangled representation of speech content, prosodic information, and speaker identity to synthesize speech in a controllable manner. The project evaluates the effectiveness of various state-of-the-art self-supervised learning methods, focusing on aspects such as reconstruction quality and how well different elements of speech are separated.

Key Features

Disentangled Representation: The project employs a technique to separately capture different elements of speech, such as content, prosody, and speaker identity, resulting in more flexible and controllable speech synthesis.
Speech Quality: The researchers assess speech quality using several benchmarks, including F0 reconstruction (a measure of pitch accuracy), speaker identification performance, and overall audio intelligibility and quality judged through subjective human evaluations.
Lightweight Speech Codec: This project also explores the potential of using these representations for an ultra-lightweight speech codec, achieving a bit rate as low as 365 bits per second while maintaining better quality compared to baseline methods.

Setup

Software Requirements

To set up the environment for this project, users need:

Python version 3.6 or higher
PyTorch version 1.8

Users should clone the project repository and install the necessary dependencies via the provided commands.

Data Preparation

LJSpeech Dataset: The LJSpeech dataset must be downloaded and downsized from 22.05 kHz to 16 kHz.
VCTK Dataset: Similarly, the VCTK dataset requires downsampling from 48 kHz to 16 kHz, along with trimming silences and padding.

Training

The training process involves different models for different tasks:

F0 Quantizer Model: This model is trained to handle pitch information, using a command specifying checkpoints and configuration files.
Resynthesis Model: Requires similar steps, focusing on reconstructing the audio using discrete representations.

Training configurations support various combinations of datasets and self-supervised learning methods such as HuBERT, CPC, and VQVAE.

Inference

Once the models are trained, users can generate synthesized speech through inference scripts. The system supports different synthesis tasks, including synthesizing multiple speakers or reshaping speech inputs from different datasets.

Preprocessing New Datasets

The project guides users on preprocessing new datasets using CPC, HuBERT, and VQVAE coding techniques. This involves parsing the outputs into usable formats, allowing the models to work with a wide variety of speech data inputs.

Licensing and Acknowledgments

The project codebase is available under a specific license, with more details on usage rights available in the license file. The implementation incorporates ideas and code from other projects like HiFi-GAN and Jukebox, highlighting a collaborative approach to speech synthesis advancements.

Citation

Users referencing this project in academic work can cite it through the provided format, acknowledging the contributions of the project developers and their comprehensive research.

The Speech Resynthesis project offers a robust framework for researchers and developers interested in advanced speech synthesis techniques, combining self-supervised learning with innovative representation disentanglement to enhance speech synthesis capabilities.