Retrieval-based-Voice-Conversion-WebUI - Simplified Voice Conversion Utilizing the VITS Framework

Introduction to the Retrieval-based Voice Conversion WebUI

Retrieval-based Voice Conversion WebUI is an easy-to-use voice conversion framework based on VITS. It is designed to provide a friendly interface for transforming voices with a focus on quality and simplicity. The project offers a streamlined experience for both training and inference, ideal for users interested in developing or experimenting with voice conversion technologies.

Key Features

Sound Source Feature Replacement: The framework utilizes top-1 retrieval to replace the input source features with those from the training set, effectively preventing timbre leakage.
Fast Training: The system can be trained quickly even on less powerful graphics cards, making it accessible for a wider audience.
Data Efficiency: Good voice conversion results can be achieved with minimal data, recommended at least 10 minutes of low-noise voice recordings.
Timbre Modification via Model Fusion: Users can modify timbre by merging models within the ckpt-merge tab.
Web Interface: A straightforward web interface allows easy access to features and operations.
Vocal and Accompaniment Separation: UVR5 models can be invoked to separate vocals from accompaniments quickly.
Advanced Vocal Pitch Extraction: Utilizes the state-of-the-art InterSpeech2023-RMVPE algorithm, which is effective and efficient.
Hardware Acceleration Support: The framework supports acceleration for A-card and I-card graphics cards.

Getting Started

Environment Setup

Ensure you are working within a Python environment version greater than 3.8.

Installation for Various Platforms

Choose one of the following installation methods:

Pip Installation: Suitable for users familiar with Python's pip tool to install necessary dependencies.
Poetry Tool: Another option is to use the Poetry dependency management tool, ideal for handling complex dependency trees and versions.

MacOS Installations

An additional script run.sh is provided for MacOS users to streamline the installation process.

Pre-Model Setup

Retrieval-based Voice Conversion requires certain pretrained models to function effectively:

Download Required Assets: These include hubert_base.pt, pretrained models, and uvr5_weights.
FFmpeg Installation: Necessary for audio processing, with platform-specific installation instructions provided.
RMVPE Model File: Required for advanced vocal pitch extraction, which should be placed in the root directory.

Special Instructions for AMD and Intel GPUs on Linux

For AMD GPU users intending to use ROCm on Linux or Intel GPU users leveraging IPEX technology, additional environment variables and driver installations might be needed.

Usage

Launch the WebUI using:

python infer-web.py

Alternatively, for users who have used Poetry for dependencies:

poetry run python infer-web.py

Demonstration and Further Reading

A demonstration video is available for a visual introduction, and extensive documentation covers common issues, experimental records, and training tutorials for AI singers.

For those interested in the technical foundations, this project builds upon various open-source projects like ContentVec, VITS, HiFiGAN, and others, ensuring robust voice conversion and feature extraction capabilities.

Community and Contributions

The project thrives with a growing community of contributors. For collaboration or further inquiries, users are encouraged to join the Discord community or explore the project's resources on popular platforms like Huggingface.

Explore, experiment, and enjoy the seamless voice conversion experience that Retrieval-based Voice Conversion WebUI offers!