One-Shot-Voice-Cloning - Enhance One-Shot Voice Cloning with Robust Speaker and Style Transfer

Unet-TTS: Enhancing One-Shot Voice Cloning

The One-Shot-Voice-Cloning project is an innovative system aimed at pushing the boundaries of text-to-speech (TTS) technology. It introduces a method designed to tackle the challenges of cloning voice and transferring speaking styles from unseen data using minimal input.

Key Features

Inference Code Availability: The project provides ready-to-use inferencing code along with pre-trained models, allowing users to generate audio outputs from any text input they desire.
Neutral Emotion Training: The model is built using a corpus with neutral emotion, thus it does not include any strongly emotional speech data. This focus makes the technology highly effective in neutral scenarios.
Out-of-Domain Transfer: The project addresses the complexity of transferring styles from data outside of its training scope. Traditional methods like speaker-embedding or unsupervised techniques face difficulties here, which this project aims to overcome.
Advanced Algorithms: Leveraging a Unet network in combination with an AdaIN layer, the proposed algorithm demonstrates robust capabilities in transferring both speaker traits and vocal style.

Usage and Installation

System Requirements: It is designed to run solely on Linux operating systems. Users must ensure they have the correct versions of TensorFlow and tensorflow-addons installed based on their CUDA version. The recommended versions are TensorFlow 2.6 and tensorflow-addons 0.14.0.

Installation is straightforward:

cd One-Shot-Voice-Cloning/TensorFlowTTS
pip install . 
(or python setup.py install)

Running the Model: For usage, there are two main methods:

Direct Script Execution: Users can directly modify the UnetTTS_syn.py to specify their reference audio file. Running the script will initiate the voice cloning process:
```
cd One-Shot-Voice-Cloning
CUDA_VISIBLE_DEVICES=0 python UnetTTS_syn.py
```
Notebook Integration: Alternatively, users can integrate and execute the cloning process within a notebook by importing necessary paths and classes, making sure that the appropriate paths are appended to the system path:
```
import sys
sys.path.append("<your repository's parent directory>/One-Shot-Voice-Cloning")
from UnetTTS_syn import UnetTTS

from tensorflow_tts.audio_process import preprocess_wav
```

The notebook method offers flexibility, allowing users to experiment and modify text and reference speech on the go.

Exciting Capabilities

One-Shot Voice Cloning: With only a single sample of the reference voice, the technology can clone and synthesize speech.
Automatic Duration Statistics: Using a Style Encoder, the system estimates the duration statistics of the reference speech automatically.
Future Enhancements: Upcoming features include multi-speaker TTS capabilities with speaker embeddings and advancements in Unet-TTS training and C++ inference support.

Resources and References

For more insights and technical information, users can explore the demo results, access the related research paper, or run tests through a Colab notebook, which provides an interactive platform to understand the system's capabilities. The project also draws upon resources such as TensorFlowTTS and Real-Time-Voice-Cloning repositories.

In summary, One-Shot-Voice-Cloning marks a significant step forward in the TTS domain, providing toolkits and methodologies designed to simplify and enhance the automation of voice cloning with unseen speaker and style transfer.