Amphion - Comprehensive Audio, Music, and Speech Toolkit with Model Visualization

Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit

Amphion is a comprehensive toolkit designed to support research and development in the fields of audio, music, and speech generation. It aims to make progress in this domain more accessible, particularly for junior researchers and engineers. Amphion's standout feature is its ability to visually demonstrate the inner workings of various audio models and architectures, facilitating a deeper understanding of these complex systems.

Project Aims

The central goal of Amphion is to serve as a platform for converting various inputs into audio formats. This includes tasks such as:

Text to Speech (TTS): Turning written text into spoken words.
Singing Voice Synthesis (SVS): Transforming text into a sung voice, which is currently in development.
Voice Conversion (VC): Changing the speaker's voice while retaining the content of the speech, also in the development phase.
Singing Voice Conversion (SVC): Converting one singing voice to sound like another, which is supported.
Text to Audio (TTA): Generating non-speech audio from text.
Text to Music (TTM): Creating music from text descriptions, another feature under development.

Beyond these, Amphion integrates advanced vocoders for audio signal processing and various evaluation metrics to ensure the quality and consistency of generated audio.

Recent Developments

2024/10/19: Launch of MaskGCT, a TTS model that simplifies the conversion by removing the need for text-to-speech alignment information.
Autumn 2024: The Emilia dataset, a massive and diverse compilation of speech data, is now publicly available, offering 101,000 hours of diverse speech data.
2024/07/01: Release of the Emilia preprocessing pipeline, enhancing raw speech data for better training outcomes.
2024/06/17: Introduction of an upgraded VALL-E model with improved performance.
2024/03/12: Support for NaturalSpeech3 FACodec with released pretrained checkpoints.

Key Features

Text to Speech (TTS)

Amphion excels in text-to-speech tasks, offering state-of-the-art models like FastSpeech2, VITS, and the latest MaskGCT among others. These models cover a range of techniques from non-autoregressive to adversarial learning.

Singing Voice Conversion (SVC)

Supporting multiple pretrained content-based models, Amphion applies diffusion, transformer, and flow-based architectures to innovatively handle singing voice conversions.

Text to Audio (TTA)

With a focus on latent diffusion models, Amphion implements techniques similar to AudioLDM and Make-an-Audio, contributing to text-to-audio generation studies.

Vocoder

The toolkit supports diverse vocoder models, including GAN-based, flow-based, diffusion-based, and auto-regressive variants, ensuring high-quality audio output.

Evaluation Metrics

Amphion offers a wide array of evaluation metrics to objectively assess generated audio in terms of pitch accuracy, energy modeling, intelligibility, and speaker similarity.

Datasets

It provides easy access to preprocess a variety of established open-source audio datasets and exclusively supports the Emilia dataset for speech generation.

Visualization Tools

Amphion features tools like SingVisio, offering visual insight into model processes, aiding educational initiatives, and supporting researchers in understanding audio generation workflows.

Installation and Usage

Amphion is user-friendly, offering both setup installer and Docker image options for installation. It can be integrated into Python environments for diverse tasks spanning vocoding, evaluation, and visualization.

Contributing to Amphion

Amphion invites contributions from the community to enhance its toolkit, reflecting its open-source ethos and commitment to collective advancement in audio technologies.

Overall, Amphion stands as a key resource for anyone interested in the field of audio generation, providing robust tools and community support to foster innovation and learning.