metavoice-src - Enhanced Emotional Speech Synthesis with Zero-Shot Voice Cloning

Introducing MetaVoice-1B

MetaVoice-1B is an advanced text-to-speech (TTS) model equipped with 1.2 billion parameters, designed to transform text into speech with enhanced emotional rhythms and tones. Developed for English, this model stands out for its capabilities in zero-shot cloning of American and British voices, needing just 30 seconds of reference audio. One of its significant feats is its support for cross-lingual voice cloning with fine-tuning, proven to work well with only a minute of training data from Indian speakers. The model can handle text synthesis of any length and is accessible under the Apache 2.0 license, allowing unrestricted use.

Quickstart Guide

MetaVoice-1B offers a straightforward approach to getting started. It comes with a web UI and server setup instructions that can be executed through a few simple commands. For using the web UI, one can initiate it with a Docker composition command, while setting up the server follows a similar method, providing access to API definitions.

Installation Requirements

To install and utilize MetaVoice-1B, specific prerequisites are necessary, such as GPU VRAM of 12GB or more, Python versions between 3.10 and 3.12, and pipx. Proper environment setup involves installing essential tools like ffmpeg and Rust if they are not pre-installed on the machine.

Project Dependencies Installation

MetaVoice recommends using Poetry for installing project dependencies due to its efficiency in managing Python dependencies. However, pip or conda can also be used, although using Poetry is highly encouraged when troubleshooting issues. Installation guides are available for both methods, ensuring that dependencies are correctly set up for model inference and fine-tuning.

How to Use MetaVoice-1B

Users can integrate and deploy MetaVoice-1B in various ways:

Local Usage: Accessible through a reference implementation script, enabling direct use on users' hardware.
Cloud Deployment: Deployable on cloud platforms like AWS, GCP, or Azure, using its server or web UI options.
Integration: Available on platforms like Hugging Face and Google Colab, providing versatile integration options for different environments.

Fine-Tuning Capability

MetaVoice-1B supports fine-tuning of its initial large language model (LLM) stage. Fine-tuning requires a specified dataset format and can be initiated with sample datasets. Users have the capability to adjust hyperparameters to tailor the model training process, including learning rate adjustments and configuration settings. Integration with W&B for monitoring and evaluation is also supported.

Project Architecture

The architecture of MetaVoice-1B is designed to handle EnCodec tokens prediction from text and speaker data, transforming this into audio with significant clarity. It employs a causal GPT for predicting EnCodec token hierarchies and utilizes a custom-trained BPE tokenizer. The architecture ensures impressive zero-shot generalization and efficient multi-band diffusion to produce high-quality waveforms, although it includes post-processing steps to reduce potential audio artifacts.

Upcoming Features

MetaVoice continues to evolve, promising faster inference speeds and improvements in generating arbitrary length text. The community can look forward to these enhancements that will make MetaVoice-1B even more powerful and versatile in TTS applications.

Overall, MetaVoice-1B is a state-of-the-art, open-access TTS model, suitable for developers and researchers looking to explore or implement advanced speech synthesis solutions.