Introducing FACodec: A Speech Codec for NaturalSpeech 3
FACodec is a sophisticated speech codec that plays a pivotal role in the advanced text-to-speech (TTS) model known as NaturalSpeech 3. Through its unique approach, FACodec is designed to convert intricate speech waveforms into distinct subspaces. These subspaces represent various speech attributes including content, prosody, timbre, and acoustic details. This innovative codec not only simplifies the modeling of speech representation but also enables the reconstruction of high-quality speech from these attributes.
Project Background
FACodec has its roots in a project initially housed at Amphion/models/codec/ns3_codec, and it's offered under the Amphion License. The project has evolved to support various modes of TTS models like non-autoregressive based discrete diffusion (NaturalSpeech 3) and autoregressive models such as VALL-E.
Installation and Setup
To use FACodec, one can begin by cloning the project repository and installing the necessary dependencies. Using Python's pip, users can install relevant libraries like torch and torchaudio to ensure compatibility and functionality. Additionally, pre-trained FACodec models are available for download on HuggingFace, facilitating easy integration into various applications.
git clone https://github.com/lifeiteng/naturalspeech3_facodec.git
cd naturalspeech3_facodec
pip3 install torch==2.1.2 torchaudio==2.1.2
pip3 install .
# pip3 install -e . # for development mode
How FACodec Works
FACodec excels in breaking down complex speech into simplified subspaces. When the speech is processed, it is represented in terms of content, prosody, timbre, and acoustic details, allowing for advanced speech synthesis. Researchers can leverage FACodec to construct different TTS models, exploring either autoregressive or non-autoregressive pathways.
Practical Usage
To implement FACodec, users can utilize a few lines of Python code to engage with pre-trained models. FACodec allows for encoding and decoding processes that transform raw audio files into high-quality synthetic speech, where each attribute of the original speech is separately encoded. This feature proves beneficial for researchers focusing on voice conversion and other speech-related tasks.
Zero-Shot Voice Conversion
FACodec also supports zero-shot voice conversion capabilities using FACodecEncoderV2/FACodecDecoderV2 or FACodecRedecoder. This means it can transform the voice of an input audio into that of a different speaker without prior training on the speaker's voice.
from ns3_codec import FACodecEncoderV2, FACodecDecoderV2
fa_encoder_v2 = FACodecEncoderV2(
ngf=32,
up_ratios=[2, 4, 5, 5],
out_channels=256,
)
fa_decoder_v2 = FACodecDecoderV2(
in_channels=256,
upsample_initial_channel=1024,
ngf=32,
up_ratios=[5, 5, 4, 2],
vq_num_q_c=2,
vq_num_q_p=1,
vq_num_q_r=3,
vq_dim=256,
codebook_dim=8,
codebook_size_prosody=10,
codebook_size_content=10,
codebook_size_residual=10,
use_gr_x_timbre=True,
use_gr_residual_f0=True,
use_gr_residual_phone=True,
)
Frequently Asked Questions
-
What audio sample rate does FACodec support?
- FACodec is optimized for 16KHz speech audio with a hop size of 200 samples.
-
Can FACodec train autoregressive TTS models?
- Yes, it supports training autoregressive models like VALL-E, generating prosody codes with an autoregressive language model.
-
Can FACodec compress and reconstruct audio from other domains?
- While designed for speech, FACodec can be somewhat effective with other audio types, though quality may vary.
-
Is FACodec useful for voice conversion?
- Yes, its content codes can serve as features in voice conversion tasks.
Conclusion
FACodec represents a significant advancement in TTS technology, providing researchers and developers with a powerful tool to distinguish and process various speech attributes efficiently. By enabling high-quality speech synthesis and offering versatile functionality, FACodec paves the way for exciting innovations in the domain of artificial speech generation.