voicebox-pytorch - Pytorch Implementation of MetaAI's Voicebox Model for Text-to-Speech Applications

Voicebox - Pytorch: A New Era in Text-to-Speech Technology

Voicebox, implemented in Pytorch, represents MetaAI's innovation in the field of text-to-speech (TTS) technology. This project offers a state-of-the-art model for generating human-like speech from text, leveraging advanced techniques and contributions from multiple experts in the field.

Background and Features

MetaAI introduced Voicebox as a breakthrough in TTS models, pushing the boundaries of speech generation technology. This implementation utilizes rotary embeddings to overcome limitations found in other embedding methods like ALiBi, particularly for bidirectional models. Voicebox also addresses important technical challenges such as time embedding inaccuracies, borrowing successful techniques from the Paella framework. It adopts adaptive normalization for its model, enhancing the performance and accuracy of text-to-speech conversions.

While the Voicebox implementation showcases remarkable capabilities, the project suggests users consider alternatives like the E2 TTS Pytorch for a comprehensive application of text-to-speech solutions.

Contributions and Support

The development of Voicebox was supported by several key figures and organizations:

Translated: Provided the Imminent Grant, fostering innovations in open-source TTS solutions.
StabilityAI: Offered sponsorship that allowed independent open-source AI development.
Bryan Chiang: Contributed through code review and shared expertise in TTS.
Manmay and @chenht2010: Assisted in initiating and refining the repository.
Lucas Newman: Played a crucial role in enhancing the training code for Spear-TTS and validating its effective integration with Voicebox.

Installation and Usage

Installing Voicebox is straightforward with Pytorch and can be done using pip:

$ pip install voicebox-pytorch

The repository provides tools for training and sampling with the TextToSemantic module from SpearTTS, facilitating the conversion of written text into audio representations. Below is a sample usage in Python demonstrating how Voicebox can be utilized for conditional and unconditional text generation:

Conditional Training

import torch
from voicebox_pytorch import VoiceBox, EncodecVoco, ConditionalFlowMatcherWrapper, HubertWithKmeans, TextToSemantic

# Setup the text-to-semantic conversion
wav2vec = HubertWithKmeans(checkpoint_path='/path/to/hubert/checkpoint.pt', kmeans_path='/path/to/hubert/kmeans.bin')
text_to_semantic = TextToSemantic(wav2vec=wav2vec, dim=512, use_openai_tokenizer=True)
text_to_semantic.load('/path/to/trained/spear-tts/model.pt')

# Initialize Voicebox model
model = VoiceBox(dim=512, audio_enc_dec=EncodecVoco(), num_cond_tokens=500, depth=2, dim_head=64, heads=16)

# Wrap the model for conditional flow matching
cfm_wrapper = ConditionalFlowMatcherWrapper(voicebox=model, text_to_semantic=text_to_semantic)

# Training example
audio = torch.randn(2, 12000)
loss = cfm_wrapper(audio)
loss.backward()

# Sampling example
texts = ['the rain in spain falls mainly in the plains', 'she sells sea shells by the seashore']
cond = torch.randn(2, 12000)
sampled = cfm_wrapper.sample(cond=cond, texts=texts)

Unconditional Training

import torch
from voicebox_pytorch import VoiceBox, ConditionalFlowMatcherWrapper

# Initialize Voicebox model for unconditional training
model = VoiceBox(dim=512, num_cond_tokens=500, depth=2, dim_head=64, heads=16, condition_on_text=False)

# Wrap the model for flow matching
cfm_wrapper = ConditionalFlowMatcherWrapper(voicebox=model)

# Training example
x = torch.randn(2, 1024, 512)
loss = cfm_wrapper(x)
loss.backward()

# Sampling example
cond = torch.randn(2, 1024, 512)
sampled = cfm_wrapper.sample(cond=cond)

Development Roadmap

The Voicebox project continues to evolve, with a clear list of completed and upcoming tasks aimed at enhancing its capabilities, including:

Improving support for different conditioning methods
Integrating adaptive normalization techniques
Enhancing compatibility with neural ODE frameworks

Acknowledgements and Citations

The Voicebox project acknowledges the contributions of authors and researchers whose works and collaborations have significantly influenced the development of this advanced TTS framework. Several academic citations offer a more extensive understanding of the technologies and methodologies integrated into Voicebox.

In summary, Voicebox serves as a revolutionary step towards more efficient and high-quality text-to-speech generation. With ongoing support and contributions from the community, it promises further advancements in creating seamless human-like auditory experiences from textual content.