WhisperSpeech - Versatile Open Source Text-to-Speech System Supporting Multiple Languages

WhisperSpeech: A New Era in Text-to-Speech Technology

WhisperSpeech is an innovative open-source text-to-speech system that transforms written text into spoken words by inverting the Whisper model. Previously known as spear-tts-pytorch, WhisperSpeech aims to be as impactful for speech as Stable Diffusion is for images: powerful yet easily customizable.

Key Features

Open Source and Commercial Use Friendly: All speech recordings used for training are properly licensed, and the code base is open-source, ensuring its safety and applicability for commercial use.
Language Versatility: Initially trained on the English LibreLight dataset, the project aspires to expand to multiple languages using Whisper and EnCodec's multilanguage capabilities.

Recent Progress

Multilanguage Voice Cloning

As of early 2024, WhisperSpeech has successfully trained a tiny S2A model capable of voice cloning in French using an English, Polish, and French dataset. This advancement was achieved with semantic tokens trained only in English and Polish, suggesting that with further development, a single semantic token model could support numerous world languages, even those not currently supported by Whisper.

Performance Optimization

Significant improvements in inference performance have been made, incorporating torch.compile, kv-caching, and other optimizations. This has resulted in WhisperSpeech running over 12 times faster than real-time on a consumer-grade 4090 GPU. Additionally, users can now mix languages within a single sentence, and voice-cloning features have been made easier to test.

Smarter S2A Model

A new, faster SD S2A model has been released, maintaining high speech quality while delivering faster processing speeds. This update also includes an example of voice cloning using a reference audio file.

How to Get Involved

WhisperSpeech welcomes community participation. You can experiment with the system on Google Colab, download models from HuggingFace, or test them out using the provided notebook locally. The project roadmap includes gathering a larger emotive speech dataset, conditioning generation on emotions and prosody, and creating a community effort to collect freely licensed speech in multiple languages.

System Architecture

WhisperSpeech's architecture draws inspiration from Google’s AudioLM, SPEAR TTS, and Meta’s MusicGen, leveraging open-source models:

Whisper: Utilized as an encoder block to generate semantic tokens for transcription.
EnCodec: Applied for modeling acoustic tokens to generate the audio waveform.
Vocos: Integrated as a high-quality vocoder for enhanced audio synthesis.

Acknowledgments

The project is a collaborative effort supported by Collabora and LAION, with computing assistance from the Jülich Supercomputing Centre. Contributions from individuals and organizations have been instrumental in the continuous development of this project.

Consulting and Contact

Collabora is available to assist with both open-source and proprietary AI projects. For inquiries or collaboration opportunities, interested parties can reach out through the Collabora website or connect via Discord.

WhisperSpeech is an exciting venture in the realm of speech technology, breaking new ground with its open-source, customizable approach to text-to-speech conversion. The dedication to expanding language support and improving performance showcases the project's commitment to innovation and inclusivity.