fish-diffusion - Streamlined Voice Generation Framework Supporting Multi-Speaker and Optimized Training

Introduction to Fish Diffusion

Fish Diffusion is a powerful and user-friendly framework designed for tackling various voice generation tasks. With a focus on Text-to-Speech (TTS), Singing Voice Synthesis (SVS), and Singing Voice Conversion (SVC), Fish Diffusion offers an accessible and effective platform for both beginners and experienced developers. This article provides an in-depth overview of the project, its features, and how to get started.

Overview and Features

Fish Diffusion utilizes cutting-edge diffusion models to address diverse voice generation tasks. It builds upon the foundation of the original diffsvc repository with the following enhancements:

Multi-Speaker Support: Easily handle voice synthesis for multiple speakers with robust support for varied audio data.
Simplified Code Structure: The code is designed to be straightforward and modular, making it easier to understand and modify.
High-Quality Audio: Supports the 44.1kHz Diff Singer community vocoder for high-fidelity audio output.
Efficient Training: Capable of multi-machine and multi-device training, including half-precision training, to enhance training speed and memory usage.

Getting Started

Environment Preparation

To start using Fish Diffusion, set up a conda environment for Python 3.10 and install the necessary dependencies:

# Install PyTorch core dependencies
conda install "pytorch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" pytorch-cuda=11.8 -c pytorch -c nvidia

# Install PDM for dependency management
curl -sSL https://raw.githubusercontent.com/pdm-project/pdm/main/install-pdm.py | python3 -

# Sync project dependencies
pdm sync

Vocoder Setup

Fish Diffusion requires the FishAudio NSF-HiFiGAN vocoder for audio generation. You can download it automatically or manually:

Automatic Download: Use the following script, agreeing to the CC BY-NC-SA 4.0 license if prompted:
```
python tools/download_nsf_hifigan.py --agree-license
```
Manual Download: Get nsf_hifigan-stable-v1.zip, unzip it, and place the nsf_hifigan folder in the checkpoints directory.

Dataset Preparation

Organize your audio data in the dataset directory, ensuring a structure that clearly separates training and validation data:

dataset
├───train
│   ├───sample1.wav
│   └───speaker0
│       └───sample2.wav
└───valid
    ├───sample3.wav

Then, extract data features using the following command:

python tools/preprocessing/extract_features.py --config configs/svc_hubert_soft.py --path dataset --clean

Training and Inference

Training Your Model

Fish Diffusion supports single and multi-node training:

Single Machine Training:

python tools/diffusion/train.py --config configs/svc_hubert_soft.py

Multi-Node Training: Additional configurations are needed for multi-node setups.

Resuming or Fine-Tuning Models:

python tools/diffusion/train.py --config configs/svc_hubert_soft.py --resume [checkpoint filepath]

Inference

Generate audio outputs using either shell commands or a web interface:

# Shell Inference
python tools/diffusion/inference.py --config [config file] --checkpoint [checkpoint file] --input [input audio] --output [output audio]

# Web Inference with Gradio
python tools/diffusion/inference.py --config [config file] --checkpoint [checkpoint file] --gradio

Contributing and Further Resources

Fish Diffusion is an open-source project open to contributions. Before submitting contributions, run pdm run lint to check your code. You can also generate real-time documentation using pdm run docs.

For more information and to explore the source code, visit the Fish Diffusion's GitHub Repository. This repository also provides detailed documentation and additional resources to help you get started with your projects.