#text-to-speech

Logo of ChatTTS
ChatTTS
ChatTTS is an advanced TTS model optimized for natural dialogue with multi-speaker and prosody control features. It supports over 100,000 hours of training data in multiple languages, surpassing many open-source models in quality. Pretrained models are available for educational and research use, enabling seamless integration into AI systems. Discover its features and ethical guidelines.
Logo of StyleSpeech
StyleSpeech
Meta-StyleSpeech is a cutting-edge Text-to-Speech model that generates personalized, high-quality speech from minimal input. By implementing Style-Adaptive Layer Normalization, it precisely adapts to a speaker's style using a single short audio clip. With enhancements like style prototypes and episodic training, it achieves superior speaker adaptation without extensive fine-tuning, suitable for various applications with available pre-trained models and detailed setup guidance.
Logo of dc_tts
dc_tts
The dc_tts project introduces a text-to-speech model that employs deep convolutional networks with guided attention, emphasizing efficient training and quality synthesis. The project examines diverse datasets such as LJ Speech and KSS, incorporating techniques like layer normalization and adaptive learning rates to improve performance. Training scripts are available for users to generate and evaluate synthetic speech, aiming for greater efficiency over Tacotron through exclusive use of convolutional layers.
Logo of chatgpt-telegram-bot
chatgpt-telegram-bot
The project enables smooth integration of Telegram with OpenAI technologies like ChatGPT, DALL·E, and Whisper, facilitating intelligent response and multimedia management. Key features include markdown support, image generation, audio transcription, and user management through budget tracking and advanced language options. Users can access the latest AI models such as GPT-4 Turbo and enhance functionality with plugins for services like weather and Spotify. With Docker compatibility, the bot is easy to deploy and use, offering a practical AI solution for interactive communication enhancement.
Logo of epub2tts
epub2tts
The tool facilitates converting EPUB or text files into audiobooks by leveraging advanced text-to-speech technologies from Coqui AI, OpenAI, and MS Edge. Features include chapter detection, cover art embedding, and voice customization with 58 studio-quality options. It supports multiprocessing for enhanced speed and offers cloud-based MS Edge TTS for convenience. Users can also utilize Coqui's XTTS model for voice cloning, offering a personalized touch. A flexible solution for audiobook creators and developers.
Logo of espeak-ng
espeak-ng
eSpeak NG is a compact text-to-speech software for Linux, Windows, and Android, supporting over 100 languages. It employs formant synthesis for clear speech delivery and functions as a command-line tool, shared library, or SAPI5 interface for Windows. This fork of eSpeak enhances language options and functionality while maintaining compatibility and supports output to WAV files, SSML, and integration with MBROLA voices.
Logo of Rodel.Agent
Rodel.Agent
Discover a feature-rich Windows application that integrates chat, text-to-image, text-to-speech, and translation features. Supporting popular AI services, the tool ensures a superior desktop AI experience. Requiring Visual Studio 2022 and .NET 8, it allows for custom configurations with modular console programs. Comprehensive documentation guides users through setup and development, enabling full utilization of its diverse AI functionalities.
Logo of XPhoneBERT
XPhoneBERT
XPhoneBERT, a multilingual phoneme model, optimizes text-to-speech (TTS) technology by refining phoneme representations. With its BERT-base architecture trained on 330 million phoneme-level sentences from about 100 languages, it enhances TTS systems' naturalness and prosody, even with limited training data. Seamlessly integrating with Python's 'transformers' package and 'text2phonemesequence' for phoneme conversion, XPhoneBERT supports efficient multilingual pre-training.
Logo of StableTTS
StableTTS
StableTTS is a state-of-the-art flow-matching TTS model that integrates DiT, supporting efficient speech generation across Chinese, English, and Japanese. This 31M parameter model enhances audio quality and supports CFG and FireflyGAN vocoders, with improvements in the Chinese text frontend. The newly released version 1.1 introduces features like U-Net-inspired skip connections and a cosine timestep scheduler, all within a single multilingual checkpoint. Designed for user-friendly training, it simplifies data preparation and finetuning, making it an adaptable solution for varied audio generation applications.
Logo of elevenlabs-examples
elevenlabs-examples
Explore a variety of demonstrations and projects showcasing the ElevenLabs API, designed for developing advanced AI audio applications. Including features such as text-to-speech, multilingual dubbing, and custom sound effects, these resources offer valuable tools and insights. Perfect for developers integrating audio capabilities into React-based sites or using pronunciation dictionaries to enhance user experiences. Start by cloning the repository and following detailed project instructions, supported by extensive API documentation, to innovate audio solutions.
Logo of chat-with-gpt
chat-with-gpt
Chat with GPT offers a flexible chat interface, leveraging the ChatGPT API and ElevenLabs, enabling fast responses, realistic text-to-speech, and speech recognition. Users can modify the AI's System Prompt, set creativity levels, and benefit from complete markdown support. Features include easy session sharing and editing, with payment options based on usage for the API. OpenAI and ElevenLabs API keys are required to start, with all keys stored securely on the user's device.
Logo of Comprehensive-Transformer-TTS
Comprehensive-Transformer-TTS
This Transformer TTS framework integrates both supervised and unsupervised modeling across various architectures, including Fastformer and Reformer. It is suitable for single and multi-speaker setups, offering prosody control, pitch, and volume adjustments. Compatibility with HiFi-GAN and MelGAN ensures superior audio quality. Comprehensive tools and easy-to-follow instructions make setup and operation seamless.
Logo of KAN-TTS
KAN-TTS
KAN-TTS provides tools to develop custom text-to-speech models, supporting multiple languages ranging from Mandarin and English to regional dialects like Shanghainese and Cantonese. It employs models such as sam-bert and hifi-GAN for effective speech synthesis. Access training materials on the KAN-TTS Wiki and test examples via ModelScope. The platform is expanding with more languages and models. Community engagement is available through DingTalk.
Logo of klaam
klaam
Klaam offers advanced Arabic speech technology using models like Wave2Vec and FastSpeech2 for recognition, classification, and text-to-speech. Supports both Modern Standard Arabic and dialects such as Egyptian, leveraging datasets like MGB-3 and Common Voice. Comprehensive guides facilitate easy integration into projects, ideal for developers working on Arabic language processing.
Logo of FunCodec
FunCodec
FunCodec is an open-source toolkit for neural speech codec applications, providing installation guides, access to pre-trained models, and comprehensive training protocols. It supports both general and custom datasets for efficient encoding and decoding. Models are available on Huggingface and ModelScope, offering codec-based text-to-speech functionality with strong semantic consistency and speaker similarity. The project integrates with frameworks like FunASR, Kaldi, and ESPnet to optimize audio data management and processing for research and development.
Logo of rvc-tts-webui
rvc-tts-webui
Discover a Gradio web interface designed for TTS using RVC models, which operates on CPUs for flexible use. The detailed guide provides installation steps, model configuration, and execution instructions, highlighting Python 3.10 compatibility and optional GPU support. Efficient management of RVC model directories and addressing non-ASCII path issues are outlined. The project offers updates and solutions for installation challenges, such as Microsoft C++ Build Tools. Access online demos and integrated voice conversion features with ease.
Logo of ospeak
ospeak
Ospeak is a CLI tool for converting text to speech using OpenAI's Text-to-Speech API, featuring customizable voice options and output formats like MP3 and WAV. It supports a variety of voice models and speeds, enabling both direct speech output in the terminal and audio file creation. An OpenAI API key is needed, and users on macOS should note specific dependency requirements. The tool is designed to facilitate easy inclusion of AI-powered speech capabilities in diverse applications.
Logo of LibriTTS-P
LibriTTS-P
LibriTTS-P offers an extensive set of annotations for text-to-speech and style captioning, encompassing both human-evaluated speaker traits and machine-generated speaking style prompts. This dataset provides a broader range of annotations for LibriTTS-R speakers, enhancing the naturalness and accuracy of models in TTS and style captioning domains. It also includes comprehensive metadata and style prompt options, facilitating the detailed study of spoken language attributes.
Logo of RealtimeTTS
RealtimeTTS
RealtimeTTS is a text-to-speech library designed for real-time applications, providing fast and high-quality audio conversion. It supports various TTS engines including OpenAI, Elevenlabs, and Azure, and offers multilingual capabilities. With a robust fallback mechanism, it ensures reliable performance. Custom installation options and high-quality speech generation make it suitable for professional environments. Its counterpart, RealtimeSTT, offers additional functionalities for comprehensive real-time audio solutions involving large language models.
Logo of AudioGPT
AudioGPT
AudioGPT is an open-source initiative providing tools for analyzing and creating speech, music, and other audio forms. The project supports tasks such as text-to-speech, style transfer, and speech recognition through models like FastSpeech and whisper. For audio manipulation, it includes tasks like text-to-audio and image-to-audio using models such as Make-An-Audio. It also offers talking head synthesis with GeneFace. As some features are being refined, AudioGPT continuously broadens its functionality for varied audio projects.
Logo of XZVoice
XZVoice
Discover a comprehensive guide to XZVoice, a text-to-speech software using Electron and Vue, featuring Alibaba Cloud's synthesis engine. Understand the process of setting up application keys and integrating music with Qiniu Cloud for enhanced functionality. Suitable for developers interested in customizable and open-source text-to-speech solutions.
Logo of vits-simple-api
vits-simple-api
The VITS API provides text-to-speech and voice conversion solutions with features like automatic language recognition, multi-model support, and GPU acceleration. It includes advanced models such as HuBert-VITS and Bert-VITS2, and supports convenient deployment through Docker or virtual environments. The WebUI interface facilitates management and the API supports SSML and customizable defaults, making it suitable for scalable applications.
Logo of Multi-Tacotron-Voice-Cloning
Multi-Tacotron-Voice-Cloning
The Multi-Tacotron Voice Cloning project is a multilingual phonemic implementation for Russian and English, built on a deep learning framework. The project, an extension of Real-Time-Voice-Cloning, facilitates the creation of numeric voice representations from brief audio samples. It includes pre-trained models and necessary datasets, providing efficient pathways for text-to-speech conversion. The diverse datasets and neural networks such as Tacotron 2 and WaveRNN enable seamless multilingual capabilities, suited for advanced TTS synthesis requirements.
Logo of megatts2
megatts2
Discover the unofficial version of Mega-TTS 2, which integrates advanced techniques for speech synthesis. This project involves a blend of Chinese and English with a planned dataset of approximately 1,000 hours to enhance audio quality using Bigvgan. Through VQ-GAN, ADM, and PLM, it aims to elevate zero-shot TTS technologies. Detailed guidance is provided for dataset preparation, model training with Pytorch-lightning, and inference testing. Released under the MIT license and backed by Simon from ZideAI, this project supports wide-ranging language adaptations.
Logo of LangHelper
LangHelper
LangHelper provides a dynamic platform for AI-driven speech recognition and synthesis, featuring multi-accent communication and celebrity voice interaction. It includes advanced pronunciation assessment tools suitable for IELTS/TOEFL test preparation and supports integration with ChatGPT and espeak-ng for efficient setup on Windows. The project is engineered to offer interactive sessions through both custom and default text-to-speech functionalities, catering to accent training and pronunciation competence.
Logo of elevenlabs-python
elevenlabs-python
Experience comprehensive text-to-speech capabilities with the Python library by ElevenLabs. This API is intended for developers and content creators, offering vibrant, realistic voices across numerous languages and accents efficiently. Featuring advanced models such as Eleven Multilingual v2 and Eleven Turbo v2.5, the library ensures consistent performance with a focus on diversity and speed. Installation and integration are straightforward, allowing users to generate audio, clone voices, and adjust settings to meet various project needs. This makes it suitable for anyone in search of professional-quality audio tools.
Logo of TranslationPlugin
TranslationPlugin
The plugin supports various translation engines, such as Google, Microsoft, and DeepL, integrated within IntelliJ-based IDEs like Android Studio. It features text-to-speech, document translation, and language conversion, enhancing productivity through automatic word selection and simple engine switching. Designed for efficient integration into development workflows, it serves developers requiring seamless translation capabilities.
Logo of HierSpeechpp
HierSpeechpp
HierSpeech++ employs hierarchical variational inference to advance zero-shot speech synthesis, enhancing robustness and expressiveness. It efficiently bridges semantic and acoustic gaps, significantly boosting naturalness and speaker similarity in TTS and voice conversion. This project includes a text-to-vec framework and a high-efficiency super-resolution process, enhancing audio from 16kHz to 48kHz. Built on PyTorch, it offers pre-trained models for further exploration, outperforming LLM-based and diffusion models in human-level quality synthesis.
Logo of PaddleSpeech
PaddleSpeech
PaddleSpeech provides a robust toolkit on the PaddlePaddle platform for speech recognition, translation, and text-to-speech synthesis. It features award-winning models recognized by NAACL2022, making it suitable for various audio processing applications across multiple languages. The toolkit offers regular updates with cutting-edge models and facilitates easy system integration. It caters to researchers and developers aiming for precise audio processing, featuring reliable text-to-speech synthesis, accurate speech recognition, and efficient speech translation.
Logo of WhisperSpeech
WhisperSpeech
Explore an innovative open source text-to-speech system designed for flexibility and commercial use. Currently supporting English with plans for multilingual compatibility, recent updates enhance performance and introduce voice cloning. Test its capabilities on Google Colab with models leveraging Whisper, EnCodec, and Vocos.
Logo of voice-builder
voice-builder
Voice Builder is an open-source text-to-speech voice creation tool that aims to simplify experimentation and enhance TTS research. It is particularly useful for languages with limited resources, providing easy-to-follow steps for installation and deployment via Google Cloud and Docker. The tool facilitates voice training for users with basic computing skills, fostering interdisciplinary collaborations without unnecessary barriers.
Logo of marytts
marytts
MaryTTS is a Java-based, open-source multilingual TTS platform. It allows easy integration into Java projects with Maven or Gradle, and supports other languages via HTTP server queries. Its installer GUI simplifies voice management. Community collaboration is encouraged with clear contribution guidelines.
Logo of edge-tts
edge-tts
The edge-tts Python module enables the integration of Microsoft Edge's text-to-speech service into applications. It supports command line for text-to-speech audio generation and playback, offers voice customization and adjustments in speech rate, volume, and pitch. Installation is straightforward with pip or pipx. While not supporting custom SSML, it enhances application accessibility with multiple voices and playback features. Included instructions and examples guide users in implementing these speech synthesis features easily.
Logo of DaVinci
DaVinci
DaVinci now supports GPT-4 and is optimized for Raspberry Pi 4 using the legacy 64-bit OS. Due to issues with the Bookworm OS as of its December 5, 2023 release, fresher hardware like the Raspberry Pi 5 may not be compatible. An update from March 26, 2024, by OpenAI requires credit pre-purchase for API use. DaVinci features an alternative voice, utilizing OpenAI's text-to-speech, removing the necessity for Amazon Polly. The Italian speaking version expands language support and eases setups by reducing AWS dependency.
Logo of bark.cpp
bark.cpp
The bark.cpp project delivers a real-time, multilingual text-to-speech solution in pure C/C++, leveraging SunoAI's Bark model. It supports CPU and GPU, utilizing AVX technologies for x86 architectures, and includes mixed F16/F32 precision with several quantization options. Suitable for a range of hardware configurations via Metal and CUDA backends, it currently supports Bark Small and Large models. Community contributions are welcomed for extending model library and functionality.
Logo of parler-tts
parler-tts
Parler-TTS is an open-source model for generating high-quality text-to-speech in different speaker styles. It provides complete access to datasets, training codes, and model weights under permissive licenses. The model supports rapid synthesis and is trained on extensive audiobook data, making it a suitable framework for researchers and developers. Parler-TTS allows for the customization of speech features through simple text prompts.
Logo of sherpa-onnx
sherpa-onnx
The repository provides local speech processing features such as speech-to-text and text-to-speech, as well as speaker identification. It supports x86, ARM, and RISC-V architectures on operating systems like Linux, macOS, Windows, Android, and iOS. API support is available in C++, Python, JavaScript, and more, with WebAssembly compatibility. This tool covers voice activity detection, keyword spotting, and audio tagging, making it suitable for various applications across different systems and hardware. Explore its capabilities on Huggingface spaces without installation.