speech-dataset-generator - Enhance Multilingual Dataset Creation for Speech Models with Advanced Audio Processing

Introduction to the Speech Dataset Generator Project

The Speech Dataset Generator project, developed by David Martin Rius, is designed to facilitate the creation of datasets for training text-to-speech or speech-to-text models. This project helps in transcribing audio files, improving their quality, and compiling them into useful datasets for further research and development.

Key Functionalities

Dataset Generation: The project can create multilingual datasets, evaluating them using Mean Opinion Score (MOS) to ensure quality.
Silence Removal: This function removes unnecessary silences from audio files, which can enhance the clarity and usability of audio data.
Audio Quality Enhancement: When needed, the audio quality is improved to ensure better dataset usability.
Audio Segmentation: The tool can cut audio files into smaller segments based on specified time ranges.
Transcription: It transcribes segmented audio files, providing their textual representation for analysis and processing.
Gender Identification: The project identifies the gender of speakers within the audio.
Pyannote Embeddings: Utilizes pyannote embeddings to detect speakers across multiple audio files.
Automatic Speaker Naming: It assigns names to detected speakers automatically for easier identification in datasets.
Multiple Speaker Detection: Capable of identifying multiple speakers within a single audio file.
Speaker Embedding Storage: Detected speakers are stored in a Chroma database, reducing the need to manually assign speaker names.
Speech Metrics: Provides metrics such as syllables and words-per-minute for detailed analysis.
Compatibility with Multiple Input Sources: Audio files can be directly uploaded or downloaded from platforms like YouTube, LibriVox, and TED Talks.

Output Examples

The project generates outputs in organized folders, which hold data like audio files, transcriptions, and metadata, ensuring that datasets are well-structured and easy to access.

Installation

The project has been successfully tested on Ubuntu 22. Users are expected to set up a virtual environment and install necessary dependencies from the requirements.txt file. Precautions need to be taken to comply with agreements for using models like the pyannote embedding model from Hugging Face.

Usage

The main script is speech_dataset_generator/main.py, which accepts various command-line arguments. It can be used to process individual audio files or entire folders and even supports downloading and processing videos from YouTube, LibriVox, and TED Talks. Several enhancers can be applied to improve audio quality during the processing phase.

Future Developments

The project is planned to expand further with features like Docker image creation, emotion recognition, enhanced age and gender classification, and increased compatibility with multiple dataset types.

Conclusion

The Speech Dataset Generator is a comprehensive tool aimed at researchers and developers working in the field of speech technology. It offers robust features to enhance, structure, and transcribe audio data effectively, making it a valuable asset for creating top-quality speech datasets.