ChatTTS - Dialogue-Focused TTS Model for Multi-Speaker Interactions

Introducing ChatTTS: A Generative Speech Model for Dialogue

ChatTTS is a cutting-edge text-to-speech (TTS) model specially designed for conversational scenarios, such as interactions with large language model (LLM) assistants. This project, at its core, focuses on producing natural and expressive speech, bringing a new level of interactivity and dynamism to dialogue-based applications.

Supported Languages

Currently, ChatTTS fully supports two languages:

English
Chinese

More languages are expected to be added in the future.

Highlight Features

Conversational TTS: ChatTTS sets itself apart by being optimized specifically for dialogue. It can handle multiple speakers and generates speech that feels natural and expressive, making it perfect for interactive conversations.
Fine-grained Control: This model is adept at predicting and managing prosodic features, such as laughter, pauses, and interjections, allowing users greater control over the nuances of speech.
Enhanced Prosody: ChatTTS excels in capturing the rhythm and intonation of speech, surpassing many other open-source TTS models. Pretrained models are available for the community to engage in further exploration and development.

Model and Dataset

ChatTTS is primarily built for academic and research purposes. The main version of the model is trained using over 100,000 hours of audio data in Chinese and English. An open-source version available on HuggingFace has been pretrained on 40,000 hours of data.

Roadmap

Release and open-source the 40k-hour base model and speaker stats file.
Implement and open-source streaming audio generation.
Offer an open-source DVAE encoder and zero-shot inference code.
Future goals include controlling multiple emotions and expanding the repository.

Licensing

The project's code is under the AGPLv3+ license, while the model itself is licensed under CC BY-NC 4.0. Both the code and the model are intended strictly for educational and research use, with no commercial application allowed.

Ethical Use and Disclaimer

With great power comes responsibility. While ChatTTS is a robust text-to-speech system, it is crucial to use it ethically. Measures such as introducing a slight amount of high-frequency noise during the training process aim to prevent misuse. Additionally, a detection model is in development to further secure responsible use.

Getting Started

To begin using ChatTTS, users can easily clone the repository from GitHub and set up their environment using pip or conda. The system also offers a variety of optional installations for more specific use cases, such as using NVIDIA GPUs.

Quick Start

Users can launch a WebUI for easy interaction or use command-line interfaces to generate audio files from text inputs.

Contact and Community

Feedback and contributions are highly encouraged through GitHub issues and pull requests. For formal inquiries and roadmap discussions, contact [email protected]. There is also a vibrant online community available through platforms like QQ Group and Discord.

Conclusion

ChatTTS opens up new possibilities for interactive, natural-sounding speech in dialogue systems. By providing a sophisticated but accessible platform for TTS, it enables developers and researchers alike to explore the potentials of conversational AI.