Introduction to MelNet
MelNet is a pioneering generative model designed to create audio content in the frequency domain. This sophisticated project revolves around producing vivid and intricate audio by understanding and manipulating audio signals at the frequency level. This offers a way to generate realistic audio by training on datasets and uncovering patterns within the signal frequencies.
Prerequisites
To dive into MelNet, it is essential to use the right tools:
- Compatible Python versions include 3.6.8 and 3.7.4.
- The project runs smoothly on PyTorch versions 1.2.0 and 1.3.0.
- Essential packages can be installed using the command:
pip install -r requirements.txt
.
Training MelNet
Datasets
MelNet supports multiple datasets such as Blizzard, VoxCeleb2, and KSS, each of which has pre-configured YAML files under the config/
directory. For those with different datasets, it is possible to create custom YAML files based on the existing templates.
- Unconditional Training: This mode is versatile and available for any dataset with a standard file extension defined in the YAML configuration.
- Conditional Training: This mode, albeit limited in scope, is currently available for the KSS dataset and parts of the Blizzard dataset.
Execution
To train the model, the following command is used:
python trainer.py -c [config YAML file path] -n [name of run] -t [tier number] -b [batch size] -s [TTS]
- The model accommodates training in steps, known as tiers. Each successive tier is typically more demanding than its predecessor, except the first. Thus, adjusting the batch size for each tier is crucial.
- Note that high data-intensive tiers like Tier 6 may exceed computational limits on a typical 16GB graphics card, even with minimized batch sizes.
- The
-s
flag toggles training specifically for Text-to-Speech (TTS) tiers, relevant only at the initial tier. For tiers beyond the first, any input for-s
is overlooked.
Sampling with MelNet
Preparing Checkpoints
- Checkpoints are integral for sampling and should reside within the
chkpt/
directory. - An essential file,
inference.yaml
, must be placed underconfig/
. This file details the number of tiers, checkpoint names, and determines the conditional or unconditional nature of generation.
Sampling Execution
The sampling process is initiated with:
python inference.py -c [config YAML file path] -p [inference YAML file path] -t [timestep of generated mel spectrogram] -n [name of sample] -i [input sentence for conditional generation]
- Timestep: Defines the mel spectrogram's duration. This relationship is roughly akin to
[sample rate] : [hop length of FFT]
. - Conditional Generation: The
-i
flag facilitates this by accepting a sentence for context, enclosed in quotes and ending with a period. It is optional for unconditional processes. - Both generation methods currently do not offer primed generation, meaning they don't extrapolate based on initial data input.
Developmental Milestones and Contributors
MelNet's journey includes successful implementation of various functions:
- Upsampling, GMM Sampling, TTS Synthesis, and unconditional audio generation are functional.
- There are successful deployments on multi-GPU setups and logging with Tensorboard.
- The team aims to enhance the model further with primed generation capabilities in future updates.
Contributors
The MelNet project owes its progress to:
- Seungwon Park
- June Young Yi
- Yoonhyung Lee
- Joowhan Song
All contributors are proud representatives from Deepest Season 6.
License
MelNet is an open-source project, adhering to the MIT License, promoting freedom in usage, modification, and distribution.
This overview captures the essence of MelNet, detailing its comprehensive approach to audio generation and its tailored journey in the realm of artificial sound creation.