#Speech Synthesis

Logo of VALL-E-X
VALL-E-X
Discover the features of a multilingual TTS model capable of zero-shot voice cloning, accent adaptation, and emotion synthesis. VALL-E X provides cross-lingual speech generation in English, Chinese, and Japanese. This open-source rendition of Microsoft's model delivers enhanced audio quality and emotion control, supporting both CPU and GPU with minimal VRAM. Explore online demos through Hugging Face or Google Colab, and access complete installation and usage instructions.
Logo of StreamSpeech
StreamSpeech
StreamSpeech is a distinctive tool in speech translation, proficient in offline and simultaneous scenarios. It employs a unified model that efficiently combines streaming ASR, speech-to-text, and speech-to-speech translation. Offering real-time intermediate outputs, it ensures low-latency communication. Notably, StreamSpeech is user-friendly with support for eight tasks, accompanied by a web GUI demo for practical experience. Designed for developers aiming to leverage cutting-edge performance in real-time audio processing.
Logo of Chinese-FastSpeech2
Chinese-FastSpeech2
The project uses an improved FastSpeech2 model for Chinese speech synthesis, focusing on vibrant and rhythmic pronunciation. It includes prosody representation and prediction enhancements. Recent updates feature prosody model training code and data preprocessing for Biaobei data. The architecture integrates FastSpeech2 and HifiGAN, utilizing a prosody vector to form three models: fastspeech_model, hifigan_model, and prosody_model. It supports both command-line and API-based text-to-speech predictions and welcomes community input and feedback.
Logo of BigVGAN
BigVGAN
BigVGAN presents a universal neural vocoder that refines speech synthesis by undergoing extensive training on varied audio datasets. It features rapid inference achieved through custom CUDA kernels and allows up to 44 kHz sampling rate for superior audio outcomes. Utilizing advanced multi-scale sub-band CQT discriminators and multi-scale mel spectrogram loss, it enhances audio fidelity and minimizes perceptual distortions, making it an essential asset for professionals in audio processing and synthesis.