#Text-to-Speech

Logo of IMS-Toucan
IMS-Toucan
IMS Toucan is a leading toolkit for multilingual Text-to-Speech Synthesis, supporting over 7000 languages. Created at the Institute for Natural Language Processing, University of Stuttgart, it provides a quick and adjustable solution, functioning efficiently with minimal computing power. Free access through Hugging Face allows exploration of demos and use of a comprehensive multilingual TTS dataset. Easy-to-follow installation instructions are available for Linux, Windows, and Mac, ensuring versatility in training and inference, with the option of using pretrained models for enhanced efficiency.
Logo of TTS
TTS
Explore a versatile Text-to-Speech library focused on ease of training, speed, and quality, with over 20 languages supported through pre-trained models. Featuring tools for dataset analysis and multi-speaker TTS capabilities, this platform efficiently trains models using advanced deep learning techniques. It supports key models like Tacotron and Glow-TTS in PyTorch, Tensorflow, and TFLite, offering extensive documentation and community support.
Logo of speech-recognition-uk
speech-recognition-uk
Discover a vast collection of resources for Ukrainian speech recognition and synthesis, featuring a range of models, datasets, and tools. This community-driven repository offers Speech-to-Text implementations such as wav2vec2 and Citrinet, along with performance benchmarks. Access a variety of datasets compiled from open sources and the community, providing essential materials for research and development in speech technology.
Logo of aspeak
aspeak
Discover aspeak, a text-to-speech client for Azure's TTS API developed using Rust with a Python interface. It facilitates RESTful and WebSocket communications, and integrates with environments like command-line interfaces and Python scripts. Available from GitHub, AUR, and PyPI, it supports multiple output formats and advanced features such as customizable profiles, proxy configuration, and a wide range of voice options. Suitable for developers looking for a comprehensive tool with solid configuration and authentication capabilities.
Logo of AIUI
AIUI
AIUI provides a voice interface for seamless interaction with GPT models in both desktop and mobile browsers. It supports natural, continuous conversation with AI and offers features like local hosting via Docker. Users can adjust language settings and choose between AI models like GPT-4 and GPT-3.5.
Logo of vits2
vits2
VITS2 advances single-stage text-to-speech synthesis by enhancing speech naturalness and computational efficiency through improved architectures and training methodologies, while reducing phoneme conversion dependence. Designed for researchers and developers, VITS2 offers multi-speaker support and end-to-end processing, paving the way for future TTS technology. Explore the demo and documentation for more insights.
Logo of vits2_pytorch
vits2_pytorch
Explore VITS2, an innovative single-stage text-to-speech model that enhances naturalness and efficiency through advanced adversarial learning and architecture design. This implementation reduces phoneme conversion dependency, supports multi-speaker synthesis, and facilitates end-to-end training. Ideal for researchers and developers looking for efficient and modern TTS solutions with transfer learning capabilities.
Logo of ttslearn
ttslearn
Discover a Python library tailored for text-to-speech synthesis with a primary focus on Japanese. While the main application is for Japanese, neural network features may also be used for other languages. Easily install using 'pip install ttslearn'. The package includes Jupyter notebooks for learning and advanced TTS recipes using JSUT and JVS corpora. Licensed under MIT, it's suitable for both commercial and non-commercial use. Access detailed documentation for further insights.
Logo of coqui-ai-TTS
coqui-ai-TTS
Discover the capabilities of a leading Text-to-Speech library supporting 16 languages and delivering efficient performance with latency below 200ms. The library includes models such as Tacotron, Glow-TTS, and VITS, with options for fine-tuning and multi-speaker TTS support. Utilize over 1100 Fairseq models for various linguistic needs and access numerous tools for training and refining speech models. Designed for a diverse range of applications, this library offers developers a flexible solution for generating high-quality speech.
Logo of Cognitive-Speech-TTS
Cognitive-Speech-TTS
Cognitive-Speech-TTS leverages Azure's text-to-speech technology to assist developers in creating applications with realistic AI voices in multiple languages. The Speech SDK ensures robust integration across platforms, complemented by REST API samples. Developers can engage with the community on Discord for support and feedback. Significant use cases include AT&T's 5G projects, Duolingo's educational tools, and Progressive's chatbot advancements. Regular updates enhance voice options and AI features, maintaining Azure's leadership in various industries, including healthcare and automotive.
Logo of fish-speech
fish-speech
Explore a robust text-to-speech system offering zero-shot and few-shot functionalities across languages like English, Japanese, and Chinese. The platform supports fast processing with a real-time factor of 1:5 on an Nvidia RTX 4060 and maintains low character and word error rates. Features include a Gradio-based web UI and a PyQt6 interface for easy cross-platform deployment on Windows, Linux, and macOS, enhanced by fish-tech acceleration.
Logo of NATSpeech
NATSpeech
NATSpeech offers a scalable, non-autoregressive text-to-speech synthesis framework, designed with user-friendly PyTorch implementation. It supports high-quality models like PortaSpeech and DiffSinger and features data processing with Montreal Forced Aligner. The efficient approach ensures resource-effective training and inference. The framework promotes ethical use by restricting unauthorized speech synthesis of individuals. Discover the advanced capabilities of NATSpeech in next-gen speech synthesis.
Logo of epub_to_audiobook
epub_to_audiobook
The tool converts EPUB files into audiobooks using Text-to-Speech APIs such as Microsoft Azure and OpenAI. Designed for integration with Audiobookshelf, it creates an MP3 for each chapter with the chapter title as metadata. Compatible with Python or Docker setups, it supports voice and language customization. Features include preview mode and search-and-replace functionality, catering to both technical and non-technical users.
Logo of DiffGAN-TTS
DiffGAN-TTS
Discover the PyTorch implementation for high-quality and efficient text-to-speech synthesis using Denoising Diffusion GANs. This architecture supports single and multi-speaker capabilities across datasets such as LJSpeech and VCTK. Using a dual-stage diffusion process, it offers improved audio fidelity and allows control over aspects like pitch, volume, and speech rate. Utilizing pre-trained FastSpeech2 models provides strong support for both naive and shallow model training. The framework includes TensorBoard integration, facilitating comprehensive audio analysis and performance monitoring.
Logo of read-aloud
read-aloud
Read Aloud is a browser extension for Chrome and Firefox that transforms webpage text into speech. It is beneficial for auditory learners, individuals with reading challenges, or those seeking to rest their eyes from digital screens. The extension offers a choice of speech voices from native browser options and premium services like Google Wavenet and Amazon Polly. Users can customize voice, speed, and pitch for a tailored auditory experience.
Logo of TTS-Voice-Wizard
TTS-Voice-Wizard
TTS Voice Wizard enhances VRChat by providing Speech-to-Text abilities and supports translation across 50+ languages, promoting seamless communication. Additionally, it allows customization with over 100 voice options and avatar controls via voice commands. The tool also integrates features like Spotify live music display, battery tracking through XSOverlay, and heart rate monitoring via Pulsoid. Compatible with Windows 10 and 11, it is a valuable tool for improving accessibility in virtual environments.
Logo of Auto-YouTube-Shorts-Maker
Auto-YouTube-Shorts-Maker
This open-source project features a free script that automates the processes of creating, editing, and voiceover for YouTube Shorts. It allows for effortless video production using AI-generated content or user inputs, saving both time and effort. Detailed installation steps are provided, requiring only basic tools such as the OpenAI API, MoviePY, and gTTS. Once configured, users simply need to provide video details to let the script handle the generation of content, text-to-speech conversion, and video editing. The finalized short video is saved in a specified directory. Although it currently lacks advanced features like graphics or subtitles, future enhancements may expand its functionality, making it a great tool for content creators aiming to streamline production without additional costs.
Logo of UniCATS-CTX-vec2wav
UniCATS-CTX-vec2wav
CTX-vec2wav is a vocoder from the AAAI-2024 paper 'UniCATS: A Unified Context-Aware Text-to-Speech Framework,' offering an advanced approach to text-to-speech enhancement through contextual VQ-diffusion and vocoding. Compatible with Linux and optimized for Python 3.9, this project provides clear guidance for both inference and training, suitable for various datasets and conditions. It supports high-fidelity output at 16kHz and 24kHz, utilizing resources such as ESPnet, Kaldi, and ParallelWaveGAN, and offers pre-trained models to advance speech synthesis development.
Logo of EDDI
EDDI
EDDI serves as an advanced companion application for Elite: Dangerous by offering real-time event responses sourced from the game and third-party tools. Seamlessly integrating with VoiceAttack, it provides spoken responses and enhanced interaction with game mechanics. EDDI's monitoring system triggers functions such as system information updates and flight log recordings. The application is user-friendly, capable of standalone installation or functioning as a VoiceAttack plugin. Smooth upgrades from previous versions and extensive voice customization options broaden user experience without bias. For more support, see the troubleshooting page.
Logo of tiktok-tts
tiktok-tts
The tiktok-tts project facilitates the creation of TikTok Text-to-Speech voices in a browser environment. It uses advanced text-to-speech technology to offer a seamless experience for generating high-quality audio content tailored for TikTok creators. The tool's straightforward interface supports dynamic and accessible video content production. Discover its features designed to enhance TikTok presence with ease and creativity.
Logo of GPT-SoVITS
GPT-SoVITS
Discover a comprehensive platform for efficient voice conversion and multilingual text-to-speech powered by a user-friendly interface. Access zero-shot and few-shot speech synthesis across multiple languages, including English, Japanese, Korean, Cantonese, and Chinese, with built-in tools for dataset preparation and text labeling. Easily deployable through Colab, Docker, and direct downloads, ensuring support for Windows, Linux, and macOS environments. Achieve realistic and flexible voice results with GPT-SoVITS-WebUI.
Logo of lobe-tts
lobe-tts
Lobe TTS provides a versatile library for Text-to-Speech and Speech-to-Text, optimized for server and browser environments. Built with TypeScript, it supports efficient voice synthesis and works with EdgeSpeechTTS, MicrosoftTTS, OpenAITTS, and OpenAISTT. The toolkit includes React Hooks and visual audio components, facilitating integration and enhancing audio playback functionality in web applications. As an open-source solution, it assists developers in building applications with comprehensive audio features.
Logo of StyleTTS
StyleTTS
Explore an innovative solution addressing text-to-speech synthesis challenges, emphasizing natural prosodic variations and diverse speaking styles. The style-based generative model incorporates the novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation to surpass state-of-the-art performances. It facilitates self-supervised learning of speaking styles, enabling the generation of varied speech with precise prosody and emotional tones without explicit categorization. This advanced TTS model enhances naturalness and similarity across single and multi-speaker datasets, promoting efficient speech synthesis.
Logo of ChatTTS-ui
ChatTTS-ui
ChatTTS-ui provides a straightforward local web UI and API for text-to-speech conversion. It accommodates multilingual text and numeric integration for flexible voice synthesis. Compatible with both Windows and Linux, it offers deployment via pre-packaged or source versions. GPU acceleration is supported on NVIDIA cards, enabling efficient API usage. Features include streamlined installation, model management, and cross-device support, catering to different computational capabilities.
Logo of PL-BERT
PL-BERT
The PL-BERT project enhances text-to-speech by using a phoneme-level BERT to predict graphemes and phonemes, significantly improving speech naturalness over current models like StyleTTS. Pre-trained on a large English dataset, it is adaptable to other languages and easily integrates into various TTS models with comprehensive setup and training guides. This method efficiently generates prosodic patterns using phoneme-only inputs.
Logo of PortaSpeech
PortaSpeech
PortaSpeech delivers a PyTorch-based generative text-to-speech system known for its compact model size and flexibility. It allows exploration of audio samples and employs pretrained models for single and batch inference. Featuring TTS controllability and supporting datasets like LJSpeech, it is designed with concise preprocessing and training guidance. It integrates vocoder options via HiFi-GAN and MelGAN for quality synthesis, making it a versatile choice for developers interested in speech synthesis. Moreover, it accommodates custom datasets and enhances alignment configurations, all while providing real-time functionality exemplified by TensorBoard.
Logo of tetos
tetos
TeToS offers a streamlined Python library for integrating multiple Text-to-Speech (TTS) providers, including Google, Azure, and OpenAI, allowing for easy customization of output with various providers, languages, and voices via command-line or API. Installation is simple with Python 3.8 or newer. The library accommodates proxy settings, enhancing its utility across different network setups, and will eventually support SSML. Currently, its functionality is available under the Apache License 2.0.
Logo of wetts
wetts
WeTTS provides a comprehensive end-to-end text-to-speech toolkit designed for robust production use. It leverages advanced models like VITS and integrates WeTextProcessing for effective text normalization and prosody control. Supporting multiple open-source datasets such as Baker and AISHELL-3, WeTTS is compatible with a wide range of hardware including x86 and Android, offering developers a reliable solution for developing high-quality TTS applications.
Logo of ChatTTS_colab
ChatTTS_colab
This project provides a user-friendly text-to-speech solution with simple deployment that avoids complex setups. It supports streaming audio output and long audio generation, featuring a voice sampling function to create and store preferred voice tones. The project facilitates role-specific narration and is operable with one click via Colab in a browser. It also allows exploration of voice tone libraries categorized by gender and age, offering a versatile toolset for speech synthesis.
Logo of audio-ai-timeline
audio-ai-timeline
Explore a detailed compilation of recent advancements in AI models for audio generation. This repository highlights innovative projects like Mustango and Music ControlNet, providing resources including sample releases, research papers, and code links. A valuable tool for researchers and developers keen on cutting-edge audio technology and AI integration in sound production.
Logo of vits
vits
Discover an innovative end-to-end TTS method that improves upon traditional two-stage systems using variational inference and adversarial learning. This approach enhances generative capabilities, resulting in natural-sounding speech. A stochastic duration predictor supports varied speech rhythms and tones from text. Human evaluations on the LJ Speech dataset demonstrate its superior performance, achieving MOS scores close to real human speech. Access the interactive demo for audio examples or explore available pretrained models.
Logo of VoiceFlow-TTS
VoiceFlow-TTS
VoiceFlow uses rectified flow matching to improve the efficiency and quality of text-to-speech synthesis. This ICASSP 2024 paper offers a detailed implementation guide covering environment setup, data preparation, training, and inference. The project advances flow matching and employs rectified flows to enhance performance and accuracy. The repository provides utility scripts and model configurations, allowing for customization across various datasets. It also presents experimental functions such as voice conversion and likelihood estimation, broadening the capabilities of flow matching in speech synthesis. Aimed at developers looking for efficient TTS solutions.
Logo of mimic-recording-studio
mimic-recording-studio
Mimic Recording Studio optimizes the collection of speech training data for Mycroft's open-source text-to-speech systems, helping to generate diverse voice outputs. Compatible with Windows, Linux, and Mac, it offers both Docker and manual installation methods. The intuitive interface facilitates the recording, trimming, and organization of WAV audio files, aiming for consistent, high-quality recordings essential for quality TTS output. With React and Flask as the tech foundation, the platform allows for corpus customization across multiple languages and supports various users through an integrated sqlite database.
Logo of gTTS
gTTS
gTTS allows Python users to leverage Google Translate's text-to-speech through simple CLI commands or library integration. Generate 'mp3' files with adjustable sentence tokenization and pronunciation for natural sound. Pip-installable for straightforward access with no Google Cloud requirements. For comprehensive usage, refer to the extensive documentation and join community discussions.
Logo of glow-tts
glow-tts
Glow-TTS uses a flow-based model for fast and parallel text-to-speech generation without external aligners. By using monotonic alignment search, it produces quick and varied speech with high quality, outperforming older models like Tacotron 2 in speed. It supports multi-speaker scenarios and long utterances with modifications like HiFi-GAN integration and blank tokens enhancing quality. Check out the demo and access pretrained models for use.
Logo of GenerSpeech
GenerSpeech
Discover GenerSpeech, a text-to-speech model designed for high-fidelity zero-shot style transfer with out-of-domain voices. Featuring advanced multi-level style transfer and generalization capabilities, it supports various datasets with available pretrained models. Access guidelines for seamless inference and data preparation to efficiently implement and train models, all while adhering to ethical standards.
Logo of nix-tts
nix-tts
Discover Nix-TTS, a cutting-edge lightweight text-to-speech system employing modular distillation for improved naturalness and clarity. With an impressive 89.34% reduction in model size, Nix-TTS ensures fast inference across devices while maintaining quality. Follow the straightforward setup instructions to implement Nix-TTS and experience real-time speech synthesis with our pretrained models. Developed by the Kata.ai research team, this approach meets modern TTS needs for efficiency and effectiveness.
Logo of voicebox-pytorch
voicebox-pytorch
This repository provides an implementation of the MetaAI Voicebox model in Pytorch for advanced text-to-speech applications. It features rotary embeddings and adaptive normalization, techniques inspired by successful AI audio projects like Paella. It includes installation instructions, usage examples, and regular updates. This project, supported by community contributions, aims to broaden access to high-quality open-source AI models for academic and commercial use.
Logo of Bridge-TTS
Bridge-TTS
Bridge-TTS uses a Schrodinger Bridge method to enhance text-to-speech synthesis, performing better than diffusion models across multiple settings. It offers precise and efficient synthesis specific to TTS tasks. For detailed insights, visit the project page and paper. The code will be published once accepted.
Logo of Awesome-Diffusion-Transformers
Awesome-Diffusion-Transformers
This extensive compilation delves into diffusion transformers used in various fields such as text, speech, and video production. It highlights groundbreaking research, including text-driven motion generation and scalable image synthesis models, illustrating the latest technological applications. With emphasis on methodologies like transformer-based denoising and high-resolution image synthesis, this collection provides valuable insights into efficient training techniques. Featuring works like MotionDiffuse and scalable diffusion models, it is designed for researchers and practitioners, offering a comprehensive overview of innovations in diffusion transformers, paired with accessible resources and recent research data.