MelNet - Advanced Generative Audio Model for Frequency Domain Synthesis

Introduction to MelNet

MelNet is a pioneering generative model designed to create audio content in the frequency domain. This sophisticated project revolves around producing vivid and intricate audio by understanding and manipulating audio signals at the frequency level. This offers a way to generate realistic audio by training on datasets and uncovering patterns within the signal frequencies.

Prerequisites

To dive into MelNet, it is essential to use the right tools:

Compatible Python versions include 3.6.8 and 3.7.4.
The project runs smoothly on PyTorch versions 1.2.0 and 1.3.0.
Essential packages can be installed using the command: pip install -r requirements.txt.

Training MelNet

Datasets

MelNet supports multiple datasets such as Blizzard, VoxCeleb2, and KSS, each of which has pre-configured YAML files under the config/ directory. For those with different datasets, it is possible to create custom YAML files based on the existing templates.

Unconditional Training: This mode is versatile and available for any dataset with a standard file extension defined in the YAML configuration.
Conditional Training: This mode, albeit limited in scope, is currently available for the KSS dataset and parts of the Blizzard dataset.

Execution

To train the model, the following command is used:

python trainer.py -c [config YAML file path] -n [name of run] -t [tier number] -b [batch size] -s [TTS]

The model accommodates training in steps, known as tiers. Each successive tier is typically more demanding than its predecessor, except the first. Thus, adjusting the batch size for each tier is crucial.
Note that high data-intensive tiers like Tier 6 may exceed computational limits on a typical 16GB graphics card, even with minimized batch sizes.
The -s flag toggles training specifically for Text-to-Speech (TTS) tiers, relevant only at the initial tier. For tiers beyond the first, any input for -s is overlooked.

Sampling with MelNet

Preparing Checkpoints

Checkpoints are integral for sampling and should reside within the chkpt/ directory.
An essential file, inference.yaml, must be placed under config/. This file details the number of tiers, checkpoint names, and determines the conditional or unconditional nature of generation.

Sampling Execution

The sampling process is initiated with:

python inference.py -c [config YAML file path] -p [inference YAML file path] -t [timestep of generated mel spectrogram] -n [name of sample] -i [input sentence for conditional generation]

Timestep: Defines the mel spectrogram's duration. This relationship is roughly akin to [sample rate] : [hop length of FFT].
Conditional Generation: The -i flag facilitates this by accepting a sentence for context, enclosed in quotes and ending with a period. It is optional for unconditional processes.
Both generation methods currently do not offer primed generation, meaning they don't extrapolate based on initial data input.

Developmental Milestones and Contributors

MelNet's journey includes successful implementation of various functions:

Upsampling, GMM Sampling, TTS Synthesis, and unconditional audio generation are functional.
There are successful deployments on multi-GPU setups and logging with Tensorboard.
The team aims to enhance the model further with primed generation capabilities in future updates.

Contributors

The MelNet project owes its progress to:

Seungwon Park
June Young Yi
Yoonhyung Lee
Joowhan Song

All contributors are proud representatives from Deepest Season 6.

License

MelNet is an open-source project, adhering to the MIT License, promoting freedom in usage, modification, and distribution.

This overview captures the essence of MelNet, detailing its comprehensive approach to audio generation and its tailored journey in the realm of artificial sound creation.