Introduction to Fish Diffusion
Fish Diffusion is a powerful and user-friendly framework designed for tackling various voice generation tasks. With a focus on Text-to-Speech (TTS), Singing Voice Synthesis (SVS), and Singing Voice Conversion (SVC), Fish Diffusion offers an accessible and effective platform for both beginners and experienced developers. This article provides an in-depth overview of the project, its features, and how to get started.
Overview and Features
Fish Diffusion utilizes cutting-edge diffusion models to address diverse voice generation tasks. It builds upon the foundation of the original diffsvc repository with the following enhancements:
- Multi-Speaker Support: Easily handle voice synthesis for multiple speakers with robust support for varied audio data.
- Simplified Code Structure: The code is designed to be straightforward and modular, making it easier to understand and modify.
- High-Quality Audio: Supports the 44.1kHz Diff Singer community vocoder for high-fidelity audio output.
- Efficient Training: Capable of multi-machine and multi-device training, including half-precision training, to enhance training speed and memory usage.
Getting Started
Environment Preparation
To start using Fish Diffusion, set up a conda environment for Python 3.10 and install the necessary dependencies:
# Install PyTorch core dependencies
conda install "pytorch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" pytorch-cuda=11.8 -c pytorch -c nvidia
# Install PDM for dependency management
curl -sSL https://raw.githubusercontent.com/pdm-project/pdm/main/install-pdm.py | python3 -
# Sync project dependencies
pdm sync
Vocoder Setup
Fish Diffusion requires the FishAudio NSF-HiFiGAN vocoder for audio generation. You can download it automatically or manually:
-
Automatic Download: Use the following script, agreeing to the CC BY-NC-SA 4.0 license if prompted:
python tools/download_nsf_hifigan.py --agree-license
-
Manual Download: Get
nsf_hifigan-stable-v1.zip
, unzip it, and place thensf_hifigan
folder in thecheckpoints
directory.
Dataset Preparation
Organize your audio data in the dataset
directory, ensuring a structure that clearly separates training and validation data:
dataset
├───train
│ ├───sample1.wav
│ └───speaker0
│ └───sample2.wav
└───valid
├───sample3.wav
Then, extract data features using the following command:
python tools/preprocessing/extract_features.py --config configs/svc_hubert_soft.py --path dataset --clean
Training and Inference
Training Your Model
Fish Diffusion supports single and multi-node training:
-
Single Machine Training:
python tools/diffusion/train.py --config configs/svc_hubert_soft.py
-
Multi-Node Training: Additional configurations are needed for multi-node setups.
-
Resuming or Fine-Tuning Models:
python tools/diffusion/train.py --config configs/svc_hubert_soft.py --resume [checkpoint filepath]
Inference
Generate audio outputs using either shell commands or a web interface:
# Shell Inference
python tools/diffusion/inference.py --config [config file] --checkpoint [checkpoint file] --input [input audio] --output [output audio]
# Web Inference with Gradio
python tools/diffusion/inference.py --config [config file] --checkpoint [checkpoint file] --gradio
Contributing and Further Resources
Fish Diffusion is an open-source project open to contributions. Before submitting contributions, run pdm run lint
to check your code. You can also generate real-time documentation using pdm run docs
.
For more information and to explore the source code, visit the Fish Diffusion's GitHub Repository. This repository also provides detailed documentation and additional resources to help you get started with your projects.