mars5-tts - A Versatile Speech Model for Complex Prosodic Applications

Introducing the MARS5-TTS Project: A New Frontier in Text-to-Speech Technology

The MARS5-TTS project, developed by CAMB.AI, introduces an innovative approach to text-to-speech (TTS) technology. Its primary goal is to deliver natural and dynamic speech generation capabilities, even in challenging prosodic scenarios such as sports commentary or anime dialogue. Below, we will explore the key features, model architecture, and practical usage guidelines for MARS5-TTS.

Approach and Architecture

At the heart of MARS5 is a two-stage pipeline featuring a unique combination of Autoregressive (AR) and Non-Autoregressive (NAR) models. In the first stage, an autoregressive transformer model processes a given text and audio to encode basic speech features. These features are then refined in the second stage, which employs a multinomial Diffusion Denoising Probabilistic Model (DDPM) to enhance detail and create the final speech output. Notably, by utilizing raw audio alongside byte-pair-encoded text, MARS5 allows users to guide speech prosody naturally through punctuation and capitalization.

Key Features

Speaker Identification: MARS5 identifies speaker characteristics from audio segments lasting between 2 to 12 seconds, with optimal results at around 6 seconds.
Natural Prosody Control: Users can manipulate prosody—speech rhythm and intonation—by incorporating punctuation and capitalization within the text input. For example, adding a comma can introduce a pause, while capitalizing a word emphasizes it.
Deep Cloning: By employing the transcript of the reference audio, users can perform a "deep clone," enhancing the quality of voice cloning at the cost of slightly longer processing times.

Getting Started with MARS5

MARS5-TTS can be easily installed and used, thanks to its integration with torch.hub. Below are the simple steps to begin synthesizing speech with MARS5:

Installation: Ensure you have Python 3.10 and the required libraries installed, such as Torch, Torchaudio, Librosa, among others. Use pip to install these:
```
pip install --upgrade torch torchaudio librosa vocos encodec safetensors regex
```

Load the Models: Download the MARS5 models from torch.hub.

import torch, librosa

mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)

Select Reference Audio and Transcript: Load your chosen reference audio and transcript (if using deep clone).

wav, sr = librosa.load('<path to 24kHz waveform>.wav', sr=mars5.sr, mono=True)
wav = torch.from_numpy(wav)
ref_transcript = "<reference audio transcript>"

Perform Synthesis: Choose between shallow or deep cloning methods and execute the synthesis.

deep_clone = True
cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100, top_k=100, temperature=0.7, freq_penalty=3)

ar_codes, output_audio = mars5.tts("The quick brown rat.", wav, ref_transcript, cfg=cfg)

Model Checkpoints and Additional Information

The MARS5 project offers access to various model checkpoints via its GitHub repository. These cover both AR and NAR models, with details such as parameter sizes and configurations included. The project is open-source, allowing users to build upon it using Docker images as a base.

MARS5-TTS is an exciting leap forward in the field of TTS, offering robust, customizable speech synthesis. Whether you are working on a high-energy sports broadcast or intricate anime dialogue, MARS5 has the capabilities to bring your project to life with dynamic, relatable voiceovers. For a deeper dive into the technical architecture and performance specifics, visit the project's official documentation online.