Diff-HierVC - Robust Hierarchical Voice Conversion with Enhanced Pitch Control and Masked Prior

Diff-HierVC: A New Era for Voice Conversion

Diff-HierVC is an innovative approach to voice conversion, designed to tackle some of the most common challenges faced in the field. Despite the impressive capabilities of existing voice conversion systems, they frequently struggle with inaccurate pitch and low-quality speaker adaptation. Diff-HierVC offers a solution with its hierarchical approach based on two diffusion models, significantly enhancing pitch accuracy and speaker adaptation quality.

Understanding Diff-HierVC

The core of Diff-HierVC lies in its unique, hierarchical framework. It introduces a novel component called DiffPitch to generate an accurate fundamental frequency, or $F_0$, that matches the target voice style. This accurately generated $F_0$ is crucial, as it is subsequently fed into another component called DiffVoice. DiffVoice takes over the task of converting the speech to align with the target voice style. By focusing on these two aspects separately, Diff-HierVC ensures precision in pitch generation and style matching.

Innovative Features

Source-Filter Encoder: Diff-HierVC utilizes a source-filter encoder to disentangle various elements of the speech signal. The converted Mel-spectrogram is used as a data-driven prior within DiffVoice, enhancing the system's ability to transfer the desired voice style effectively.
Masked Prior: By implementing a masked prior in diffusion models, the system refines its speaker adaptation quality. This innovation leads to a more natural and authentic voice conversion experience.

Performance and Benefits

The results of using Diff-HierVC are significant, showcasing its superiority in both pitch generation and voice style transfer. The system's performance is quantifiable, achieving a Character Error Rate (CER) of 0.83% and an Equal Error Rate (EER) of 3.29% in zero-shot voice conversion scenarios. These metrics underline the model's capacity to adapt quickly to new speakers without extensive prior data—a critical feature for real-world applications.

Practical Application

To facilitate easy use, Diff-HierVC has provided pre-trained model checkpoints and scripts to run the system with minimal setup. Users can test the model on their datasets and explore the results with their unique audio inputs. This flexibility and accessibility make Diff-HierVC not only a cutting-edge research tool but also a practical solution for voice conversion needs.

Conclusion

Diff-HierVC represents a major advancement in voice conversion technology. By addressing the persistent issues of pitch accuracy and speaker adaptation, it paves the way for more effective and reliable voice conversion systems. Its hierarchical approach, combined with robust pitch generation and adaptive capabilities, establishes Diff-HierVC as a leading solution in the field of voice conversion, promising new possibilities for audio processing and communication technologies.

This project, already acknowledged in the Interspeech 2023, reflects the dedication and innovation of its creators, Ha-Yeong Choi, Sang-Hoon Lee, and Seong-Whan Lee, bringing us closer to seamless and versatile voice conversion solutions.