so-vits-svc - Emphasizing Singing Voice Conversion with SoftVC and VITS Model

SoftVC VITS Singing Voice Conversion

Introduction

The so-vits-svc project, or SoftVC VITS Singing Voice Conversion, is a unique open-source initiative focused on transforming singing voices rather than converting text to speech. Unlike typical Text-to-Speech (TTS) systems, this project specializes in Singing Voice Conversion (SVC), separating it from the VITS framework. The core objective of this project is to allow users to make anime characters sing, emphasizing fictional characters over real individuals.

Project Objectives

The original intention behind this project is to enable developers and users to have anime characters perform singing tasks. The developers consciously choose to focus on fictional characters to avoid any implications involving real personalities.

Project Disclaimer

The project operates entirely offline and does not store or collect user data. Contributors, including the SvcDevelopTeam, maintain that they do not have control over how the project is used. They have not offered assistance to any external entity concerning dataset handling, computations, or model training. Consequently, any AI models or synthesized audio created using this project are beyond the responsibility of the contributors, and users are accountable for any outcomes arising from their use.

Terms of Use

This project is solely intended for academic and learning purposes.
Any content generated using this project, when shared on video platforms, must clearly attribute the original source of the input audio.
Users are strictly responsible for ensuring they have authorization to use any datasets for training. Failure to use authorized datasets may have legal consequences.
Engaging in illegal activities or misusing the project for unethical purposes, including religious or political matters, is prohibited.
Continued use of the project implies acceptance of these terms.

Technical Overview

The project's model leverages the SoftVC content encoder to extract features from audio sources. These features are directly fed into the VITS model, allowing the preservation of pitch and intonation without the need for text-based conversion. Additionally, the vocoder has been enhanced with NSF HiFiGAN to address sound interruptions.

Version 4.1-Stable Updates

Feature inputs are now based on the 12th Layer of the Content Vec Transformer, ensuring compatibility with previous versions.
The update introduces shallow diffusion models to enhance sound quality.
New additions include the Whisper-PPG encoder, static and dynamic sound fusion, loudness embedding, and a feature retrieval function from RVC.

Model Compatibility and Configurations

Users wishing to adapt the 4.0 model for new features must adjust the config.json file accordingly, particularly by adding the "speech_encoder" field.

System Requirements

Stable functionality has been tested on Python version 3.8.9.

Pre-trained Models

Users must select a suitable encoder to proceed with the pre-trained models. Options include:

ContentVec Encoder (Recommended):
- Obtain the "checkpoint_best_legacy_500.pt" file and place it in the pretrain directory.
Hubert Soft Encoder:
- Use the "hubert-soft-0d54a1f4.pt" and place it in the pretrain directory.
Whisper-PPG Encoder:
- Download the "medium.pt" model compatible with whisper-ppg.

This project emphasizes its use strictly for education and development in fictional singing voice transformations, distancing itself from any potential misuse involving real individuals.