CosyVoice_For_Windows - Improving Speech Synthesis Efficiency with Advanced Model Inference

Introduction to CosyVoice_For_Windows

CosyVoice_For_Windows is an advanced speech synthesis toolkit that provides users with high-quality and versatile text-to-speech (TTS) capabilities. Designed for Windows, this toolkit enables users to convert text into lifelike speech using cutting-edge artificial intelligence models. CosyVoice is ideal for researchers, developers, and enthusiasts aiming to explore the capabilities of synthetic audio or integrate TTS technology into their applications.

Setup Requirements

To experience the optimal performance with CosyVoice, it’s essential to have Python 3.11 installed due to performance enhancements in this version. Additionally, ensure that CUDA 12.6 and cuDNN 9.4 are set up for faster model inference, particularly if working with NVIDIA GPUs.

With these prerequisites:

Install the project dependencies:

pip3 install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Running the Service

CosyVoice can be run as a local service using Python, providing endpoints for text conversion to speech, subtitle generation, and audio output.

To launch the API service, execute the following:
```
python3 api.py
```
Access the output through:
- API URL: http://localhost:9880/?text=YourTextHere&speaker=SpeakerName
- Subtitle File: http://localhost:9880/file/output.srt
- Audio File: http://localhost:9880/file/output.wav

Installation Instructions

Clone the Repository

First, clone the CosyVoice_For_Windows repository and its submodules to your local machine:

git clone --recursive https://github.com/v3ucn/CosyVoice_For_Windows.git
cd CosyVoice_For_Windows
git submodule update --init --recursive

Set up a Conda environment specifically for CosyVoice:

Install Conda from the Miniconda website.

Create and activate the Conda environment:

conda create -n cosyvoice python=3.11
conda activate cosyvoice
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

Model Download

To fully utilize CosyVoice, download the pretrained models:

Use ModelScope SDK for downloading:

from modelscope import snapshot_download
snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
snapshot_download('speech_tts/speech_kantts_ttsfrd', local_dir='pretrained_models/speech_kantts_ttsfrd')

Alternatively, use Git for model downloading, ensuring Git LFS is installed:

mkdir -p pretrained_models
git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M

Usage Examples

CosyVoice offers several ways to perform speech synthesis:

Zero-Shot Inference: Generates speech without needing a language-specific model.
SFT Inference: Utilizes fine-tuned models for more tailored speech synthesis.
Cross-Lingual Inference: Capable of synthesizing speech across different languages.
Instruct Inference: Customizes speech output with specific instructions in dialogue.

All these use cases can be explored and executed by integrating CosyVoice with Python scripts, leveraging the torchaudio library for output audio processing.

Web and Advanced Usage

CosyVoice includes a web interface for easy access to its functionalities. It supports all model types like SFT, zero-shot, cross-lingual, and instruct inference. To start the web UI:

python3 webui.py --port 9886 --model_dir ./pretrained_models/CosyVoice-300M

For those interested in customizing or deploying CosyVoice as a service, advanced training and inference scripts are available. Additionally, grpc can be optionally integrated for deploying models as scalable services.

Support and Acknowledgments

CosyVoice’s development incorporates significant contributions from multiple open-source projects like FunASR, FunCodec, and others, ensuring that users have a robust toolkit for exploring speech synthesis.

Join discussions or seek assistance through the GitHub Issues page or the official chat group for community support.