Multi-Tacotron-Voice-Cloning - Phoneme-Based Voice Cloning with Deep Learning for Russian and English

Multi-Tacotron Voice Cloning

Multi-Tacotron Voice Cloning is an intriguing project that has brought forth a phonemic multilingual (Russian-English) implementation based on the original Real-Time-Voice-Cloning project. This project presents a sophisticated four-stage deep learning framework capable of creating numerical representations of voice from mere seconds of audio input and using these representations to shape a text-to-speech model. Essentially, if the English-only version is adequate for your needs, the original Real-Time-Voice-Cloning implementation is recommended instead.

Project Origins and Foundations

The Multi-Tacotron Voice Cloning project draws its core from the Real-Time-Voice-Cloning effort, which aims at synthesizing authentic voice clones rapidly. Enhancing this foundation to capture both Russian and English phonetics, the project is well-suited for cross-linguistic voice cloning endeavors, leveraging the power of artificial intelligence in capturing and reproducing the nuances of human speech across different languages.

Quick Start

To quickly dive into the project, an online demonstration is readily available on Google Colab. Users can test the voice cloning capabilities directly through this colab online demo.

Requirements

The project necessitates:

Python 3.6 or newer
PyTorch 1.0.1 or later

Essential packages can be installed using the command: pip install -r requirements.txt. A GPU is indispensable for deploying the toolbox; however, a top-tier GPU is not a prerequisite unless one intends to retrain the models.

Pretrained Models

For practical involvement without delving deep into technicalities, pre-trained models are conveniently available for download here.

Datasets Used

A variety of linguistically diverse datasets prop up the project’s capabilities:

Phoneme Dictionaries: For both English and Russian.
LibriSpeech and VoxCeleb: For English, offering hundreds to thousands of vocal samples.
M-AILABS, open_tts/open_stt: Russian datasets are well-represented, offering hours of speech with varying quality.

These datasets enable a broad base learning, catering to the intricate phonetics of both languages.

Using the Toolbox

For hands-on experimentation, the toolbox can be employed easily:

Run: python demo_toolbox.py -d <datasets_root>
or simply: python demo_toolbox.py

Additional Resources and Contribution

Detailed information about using the pre-trained models and training for different languages is accessible on the project's Wiki. Contributions and inquiries are encouraged, with a contact email provided for communication.

Papers and Research Implemented

The essence of this project is grounded in scholarly research, implementing cutting-edge papers such as:

SV2TTS, focusing on transfer learning for text-to-speech synthesis.
WaveRNN, for efficient neural audio synthesis.
Tacotron 2, regarding natural TTS synthesis.
GE2E, an encoder model for speaker verification.

These elements together ensure that Multi-Tacotron Voice Cloning remains at the forefront of voice synthesis technology, melding linguistic diversity with advanced machine learning techniques.