Overview of stable-audio-metrics
The stable-audio-metrics
is a comprehensive suite designed to assess the performance of music and audio generative models. It uses several well-known metrics tailored for this purpose:
- Fréchet Distance at 48kHz, leveraging the Openl3 framework.
- Kullback–Leibler Divergence at 32kHz, based on PaSST.
- CLAP Score at 48kHz, which uses the CLAP-LAION model.
Each of these metrics has been adapted for realistic scenarios where long-form, full-band stereo audio generations are evaluated. What sets stable-audio-metrics
apart is its flexibility in handling audio inputs of varying lengths.
Installation Guide
To get started with stable-audio-metrics
, one needs to clone the repository and set up a virtual environment. Here’s a quick step-by-step guide:
- Clone the Repository: Begin by cloning the project repository.
- Create and Activate Virtual Environment: Use the command
python3 -m venv env
to create a virtual environment and then activate it usingsource env/bin/activate
. - Install Dependencies: Run
pip install -r requirements.txt
to install necessary dependencies.
Notes:
- GPU Support: The package is designed to leverage GPU capabilities because processing might be slow on a CPU.
- Troubleshooting: Users might need an older version of CUDA, such as CUDA 11.8, for compatibility with Openl3 dependencies.
Detailed Documentation
stable-audio-metrics
offers robust documentation for its main features:
- Fréchet Distance, detailed in the script:
src/openl3_fd.py
- Kullback–Leibler Divergence, documented in:
src/passt_kld.py
- CLAP-LAION Score, detailed in:
src/clap_score.py
Additionally, example scripts are provided for practical guidance on using these metrics:
- Example using Fréchet Distance:
examples/musiccaps_openl3_fd.py
- Example with Kullback–Leibler Divergence:
examples/musiccaps_passt_kld.py
- Example for CLAP-LAION Score:
example/musiccapss_clap_score.py
The documentation also includes examples for evaluating datasets such as MusicCaps, AudiocCaps, and Song Describer.
How to Use
Users can adjust the provided examples to fit the folder containing their audio data. For instance, one can run the command: CUDA_VISIBLE_DEVICES=6 python examples/audiocaps_no-audio.py
to conduct evaluations with Audiocaps.
Special Features:
- Metrics without Datasets: The
no-audio
examples enable evaluations without needing to download datasets, as reference statistics and embeddings are pre-computed. - Comparing with Stable Audio: To compare against Stable Audio, ensure all parameters match those in the
no-audio
examples. The system will manage resampling and mono/stereo handling for an accurate comparison.
Data Structure Guidelines
For each dataset, generate audio for every prompt and name each file according to its corresponding ID. Here's a basic outline for some datasets:
- MusicCaps: Create audio files named after the
ytid
fromload/musiccaps-public.csv
. - Audiocaps: Name files based on the
audiocap_id
fromload/audiocaps-test.csv
.
This structure can be extended to other datasets, as demonstrated with the Song Describer dataset. Further details can be found in the examples' documentation.