whisper-vits-svc - Using Variational Inference for Effective Singing Voice Conversion with VITS

Variational Inference with Adversarial Learning for End-to-End Singing Voice Conversion

Introduction

The whisper-vits-svc project is an intriguing application of deep learning techniques, designed to facilitate singing voice conversion (SVC). Singers, musicians, and developers with a keen interest in audio processing can harness this technology to convert singing voices between different singers while retaining the original musical emotion and style. This introduction will guide you through the essentials of the project, its objectives, and the technical features that make it remarkable.

Project Objectives

The primary goal of the project is to bridge the gap between theoretical learning and practical application for beginners in deep learning, specifically focusing on Python and PyTorch frameworks. By integrating hands-on experiences with the fundamental concepts of deep learning, this project aims to make the learning process engaging and valuable.

The project does not support real-time voice conversion currently and doesn't plan to package its solution for immediate use in other applications.

System Requirements

To train the models effectively, a minimum of 6GB VRAM is required. The system supports multiple speakers and allows the creation of unique speaker profiles by mixing characteristics. It can handle voice conversions even with light musical accompaniment and offers the flexibility to edit pitch using Excel.

Model Features

The whisper-vits-svc project combines several powerful technologies:

Whisper from OpenAI, known for its strong noise immunity.
BigVGAN from NVIDIA, which improves audio quality by clarifying formants and sound.
Natural Speech by Microsoft, helps reduce mispronunciation errors.
Neural Source-Filter and Pitch Quantization both contribute to addressing audio F0 discontinuity and embedding F0.
Speaker Encoder by Google, facilitates timbre encoding and clustering.
HiFTNet by Columbia University accelerates the inference speed.

Project Highlights

Supports the mixing of speaker profiles, allowing for creative audio creations.
Capable of robust pitch editing using spreadsheets, making it accessible for users without advanced audio software knowledge.
Utilizes data perturbation techniques which, while lengthening the training period, enhance conversion stability and sound quality.

Setup Instructions

Install PyTorch: Follow the installation guide on their official website.
Dependencies: Use the command pip install -r requirements.txt to install project dependencies.
Model Downloads: A series of pre-trained models and encoders need to be downloaded and set correctly in predefined directories.

Training and Inference

The training process involves preparing a dataset with clear labelling separated into different speakers. The project supports a detailed preprocessing pipeline, ensuring audio quality and compatibility for conversion tasks. The training can resume flexibly, and progress can be monitored using TensorBoard.

For inference, users can convert audio files by specifying certain parameters like pitch shifts, content vectors, and adjusting F0 manually, allowing for substantial customization of the output.

Advanced Features

The model allows feature retrieval to stabilize timbre by utilizing a retrieval index, taking full advantage of pre-trained weights, and further enhancing sound quality. The "Create Singer" feature, beautifully named EVE, allows users to combine attributes from multiple speaker profiles to create a new, entirely synthesized voice.

Available Datasets

The project supports an array of open-source datasets crucial for training models effectively. Example datasets include:

KiSing
OpenCpop
Multi-Singer
VCTK

Conclusion

The whisper-vits-svc project is a comprehensive toolkit for enthusiasts and researchers interested in the domain of singing voice conversion. It effectively combines sophisticated AI models and practical utilities to deliver an engaging learning experience and groundbreaking audio manipulation capabilities. By employing this system, users can explore the fascinating intersection of music and artificial intelligence, creating a platform for innovation in the field of sound and voice processing.