EmoV-DB: An Introduction to the Emotional Voices Database
The Emotional Voices Database (EmoV-DB) is an innovative resource designed to advance the field of emotional speech synthesis. Drawing on components of the CMU arctic database, EmoV-DB seeks to incorporate emotional expressiveness into voice generation systems. This database offers a rich assortment of recordings featuring different emotional styles, contributed by four distinct speakers.
Downloading EmoV-DB
For those looking to obtain the EmoV-DB dataset, there are two primary download options. A sorted version is recommended, which can be accessed at OpenSLR. There is also an older, slower link available via Mega, which retains the folder structure necessary for using the "load_emov_db()" function. Alternatively, an unsorted version can be found on the Northeastern University Research page.
Utilizing Forced Alignments
Forced alignment is an essential procedure in working with the dataset. It involves matching the text transcription of an audio segment with its precise timing in the speech output. This technique not only helps in identifying the timing of words but also enables the separation of verbal from non-verbal vocal sounds, such as laughter or yawns, that might precede or follow a sentence.
Alignments with Montreal Forced Aligner (MFA)
To leverage MFA, one must first install it as per the instructions available on the Montreal Forced Aligner documentation. The alignment process utilizes pretrained acoustic and grapheme-to-phoneme (G2P) models. Users can set up these alignments through Python scripts and shell commands that manipulate the dataset according to specified emotional and phonetic parameters.
Alternative Alignment with Gentle
An older method involves using a tool called Gentle, though it is generally considered less efficient than MFA. Gentle requires a comprehensive setup process but can still perform the core function of analyzing and aligning speech data with textual transcripts.
Data Overview
EmoV-DB is meticulously curated for synthesizing emotional speech and includes both male and female voices, each portraying five different emotions: neutrality, sleepiness, anger, disgust, and amusement. The dataset comprises recordings from two male and two female speakers, each contributing a variety of emotional styles across numerous audio files in 16-bit WAV format.
Speaker and File Details:
-
Spk-Je (Female, English):
- Neutral(417 files), Amused(222 files), Angry(523 files), Sleepy(466 files), Disgust(189 files)
-
Spk-Bea (Female, English):
- Neutral(373 files), Amused(309 files), Angry(317 files), Sleepy(520 files), Disgust(347 files)
-
Spk-Sa (Male, English):
- Neutral(493 files), Amused(501 files), Angry(468 files), Sleepy(495 files), Disgust(497 files)
-
Spk-Jsh (Male, English):
- Neutral(302 files), Amused(298 files), Sleepy(263 files)
File naming conventions in the dataset are systematic, denoting the emotional style, annotation range, and sentence number, aiding users in quickly identifying and accessing the required files.
Conclusion
The EmoV-DB not only contributes significantly to research in emotional speech synthesis but also serves as a cornerstone in developing nuanced voice generation systems. By enabling focused research delineated by specific emotions, EmoV-DB is a critical resource for advancing voice technology applications. For more detailed insights, prospective users are encouraged to consult the published study and to cite it according to the provided BibTeX reference when utilized in academic contexts.