stable-audio-metrics - Evaluate Music Generative Models with Precision Metrics at Various Sampling Rates

Overview of stable-audio-metrics

The stable-audio-metrics is a comprehensive suite designed to assess the performance of music and audio generative models. It uses several well-known metrics tailored for this purpose:

Fréchet Distance at 48kHz, leveraging the Openl3 framework.
Kullback–Leibler Divergence at 32kHz, based on PaSST.
CLAP Score at 48kHz, which uses the CLAP-LAION model.

Each of these metrics has been adapted for realistic scenarios where long-form, full-band stereo audio generations are evaluated. What sets stable-audio-metrics apart is its flexibility in handling audio inputs of varying lengths.

Installation Guide

To get started with stable-audio-metrics, one needs to clone the repository and set up a virtual environment. Here’s a quick step-by-step guide:

Clone the Repository: Begin by cloning the project repository.
Create and Activate Virtual Environment: Use the command python3 -m venv env to create a virtual environment and then activate it using source env/bin/activate.
Install Dependencies: Run pip install -r requirements.txt to install necessary dependencies.

Notes:

GPU Support: The package is designed to leverage GPU capabilities because processing might be slow on a CPU.
Troubleshooting: Users might need an older version of CUDA, such as CUDA 11.8, for compatibility with Openl3 dependencies.

Detailed Documentation

stable-audio-metrics offers robust documentation for its main features:

Fréchet Distance, detailed in the script: src/openl3_fd.py
Kullback–Leibler Divergence, documented in: src/passt_kld.py
CLAP-LAION Score, detailed in: src/clap_score.py

Additionally, example scripts are provided for practical guidance on using these metrics:

Example using Fréchet Distance: examples/musiccaps_openl3_fd.py
Example with Kullback–Leibler Divergence: examples/musiccaps_passt_kld.py
Example for CLAP-LAION Score: example/musiccapss_clap_score.py

The documentation also includes examples for evaluating datasets such as MusicCaps, AudiocCaps, and Song Describer.

How to Use

Users can adjust the provided examples to fit the folder containing their audio data. For instance, one can run the command: CUDA_VISIBLE_DEVICES=6 python examples/audiocaps_no-audio.py to conduct evaluations with Audiocaps.

Special Features:

Metrics without Datasets: The no-audio examples enable evaluations without needing to download datasets, as reference statistics and embeddings are pre-computed.
Comparing with Stable Audio: To compare against Stable Audio, ensure all parameters match those in the no-audio examples. The system will manage resampling and mono/stereo handling for an accurate comparison.

Data Structure Guidelines

For each dataset, generate audio for every prompt and name each file according to its corresponding ID. Here's a basic outline for some datasets:

MusicCaps: Create audio files named after the ytid from load/musiccaps-public.csv.
Audiocaps: Name files based on the audiocap_id from load/audiocaps-test.csv.

This structure can be extended to other datasets, as demonstrated with the Song Describer dataset. Further details can be found in the examples' documentation.