Introduction to LibriTTS-P
LibriTTS-P is an innovative corpus designed for improving the naturalness and accuracy of text-to-speech (TTS) systems and style captioning tasks. This corpus builds upon LibriTTS-R, incorporating detailed prompts that describe speaking styles at the utterance level and speaker characteristics at the speaker level. What makes LibriTTS-P unique is its comprehensive approach to annotation that blends human and synthetic observations.
Key Features of LibriTTS-P
-
Diverse Prompt Annotations:
- LibriTTS-P sets itself apart by providing a rich array of prompt annotations that describe not just the audio content but also its stylistic nuances and speaker identities.
- Compared with existing English datasets, it offers more diverse annotations applicable to all the speakers contained in LibriTTS-R.
-
Hybrid Annotation Approach:
- The dataset employs two types of annotations: manual annotations which capture the human perception of speaker characteristics, and synthetic annotations focused on speaking style.
- This hybrid approach ensures that the corpus captures a wide breadth of data variations, enhancing model training for TTS applications.
-
Improved TTS and Style Captioning:
- Experiments demonstrate that TTS models trained using LibriTTS-P achieve greater naturalness compared to those built on conventional datasets.
- In style captioning tasks, models using LibriTTS-P generate more accurate descriptive terms, enhancing the expressiveness and precision of automatically generated captions.
Dataset Structure
The LibriTTS-P dataset contains various files organized under a data
directory:
-
Speaker Prompt Data:
- Files like
df1_en.csv
,df2_en.csv
, anddf3_en.csv
contain annotations from different annotators about speaker prompts.
- Files like
-
Exclusion Lists:
- Certain audio files are advised to be excluded due to inconsistent speaker gender (
excluded_spk_list.txt
) or failed speech restoration, making them unsuitable for annotation (unannotated_spk_list.txt
).
- Certain audio files are advised to be excluded due to inconsistent speaker gender (
-
Style Prompt Candidates:
- The file
style_prompt_candidates_v230922.csv
outlines different style prompts in the form of keys, such as gender, pitch, speaking speed, and loudness. For instance, a key like "M_p-low_s-slow_e-low" describes a male voice with low pitch, slow speed, and low volume.
- The file
-
Metadata File:
- The
metadata_w_style_prompt_tags_v230922.csv
file provides detailed metadata for each audio entry in the dataset, including the speaker ID, gender, and other speaking style elements. This file allows users to easily associate style prompts with specific audio files.
- The
Access and Usage
LibriTTS-P is licensed under CC BY 4.0, allowing for broad use and adaptation as long as appropriate attribution is given. Researchers and developers working on TTS and related fields can harness this rich dataset to refine their models and enhance the naturalness of synthetic speech output. The dataset is freely accessible as part of the LibriTTS project, with detailed papers and demos available for further study.
Conclusion
LibriTTS-P represents a significant step forward in TTS and style captioning research, offering unprecedented detail in prompts that echo natural human perception. Its hybrid annotation approach and comprehensive prompt diversity make it a valuable resource for advancing the field of speech synthesis and leveraging technology for more natural interactions.