parler-tts - Open-source TTS Model Supporting Diverse Speaker Styles

Introduction to Parler-TTS

Parler-TTS is an innovative and lightweight text-to-speech (TTS) model focused on creating high-quality, natural-sounding speech that mirrors the characteristics of a given speaker, such as gender, pitch, and speaking style. It represents an effort to replicate and extend the research paper "Natural language guidance of high-fidelity text-to-speech with synthetic annotations" authored by Dan Lyth and Simon King, affiliated with Stability AI and the University of Edinburgh, respectively.

A notable aspect of Parler-TTS is its fully open-source nature. Every part of the project, including datasets, pre-processing methods, training code, and model weights, is publicly available. This transparency encourages community collaboration, enabling anyone to build upon the work and develop their own TTS models.

Key Features

Open-Source Release: Parler-TTS stands out due to its open-source availability, allowing developers and researchers to access and integrate its functionality into their projects without any restrictions.

Diverse Model Checkpoints: New checkpoints have been released, such as Parler-TTS Mini (880 million parameters) and Parler-TTS Large (2.3 billion parameters), trained on an extensive corpus of 45,000 hours of audiobook data. These models promise improved performance and are optimized for quicker speech generation.

Flexible Usage: The model can be easily utilized to generate speech with varying voice characteristics by providing a descriptive text prompt. This allows users to select different speakers and customize speech parameters to suit various applications.

Getting Started

Installation

Parler-TTS is designed to be quickly and effortlessly installed. Simply run the following command in your terminal:

pip install git+https://github.com/huggingface/parler-tts.git

For users with Apple Silicon, an additional step is required for enhanced PyTorch support:

pip3 install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

Running Parler-TTS

Random Voice Generation: To generate speech in a random voice using Parler-TTS, it requires an input text prompt and a description of the desired speech qualities.

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

prompt = "Hey, how are you doing today?"
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Using Specific Speakers: The model offers consistency in speaker voice by providing a selection of 34 predefined speaker profiles such as Laura, Mike, or Emily. Users can specify which speaker to use, adding further customization to the voice output.

Optimizing Inference Speed

The Parler-TTS team has created an inference guide to accelerate speech generation. This includes methods like SDPA, torch.compile, and streaming, which help achieve faster text-to-speech conversions.

Training and Development

Parler-TTS offers an extensive framework for training the TTS model or fine-tuning it to adapt to specific requirements. All necessary training guidelines and resources are housed in the project’s training folder.

Acknowledgements

Parler-TTS owes much to the contributions of open-source communities and various technologies. Special thanks go to Dan Lyth, Simon King, libraries including Hugging Face’s datasets and transformers, as well as others like Jiwer and Wandb. Hugging Face’s support with computational resources has also been invaluable.

Contributions and Future Directions

The Parler-TTS project invites public contribution, seeking enhancements in data diversity, model training, multilingual support, and optimization techniques. Continuous development is set to improve both the quality and speed of TTS tasks, offering exciting potential for future advancements in speech synthesis technology.