Introducing StreamSpeech
StreamSpeech is an innovative project developed for real-time speech-to-speech translation. This project is outlined in the ACL 2024 paper titled "StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning", showcasing a novel approach in the realm of translation and speech synthesis.
Highlights of StreamSpeech
StreamSpeech projects itself as a leader in both offline and simultaneous speech-to-speech translation domains, achieving state-of-the-art performance. The model stands out due to its "All in One" seamless design, allowing it to conduct streaming ASR (Automatic Speech Recognition), as well as simultaneous speech-to-text and speech-to-speech translations in one go. Moreover, it can provide intermediate results, such as ASR or translation, during the translation process. This feature is crucial for ensuring low-latency in communications, thereby making interactions more efficient.
Recent Updates
As a project constantly under refinement, StreamSpeech received some exciting updates. As of June 17th and June 5th, interested individuals can explore its capabilities via a Web GUI demo in their browsers. The project's code, models, and paper are openly accessible, allowing deeper engagement and experimentation.
Features of StreamSpeech
Supported Tasks
StreamSpeech supports a diverse range of tasks, neatly categorized into two types — Offline and Simultaneous:
- Offline tasks include Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech-to-Speech Translation (S2ST), and Text-to-Speech Synthesis (TTS).
- Simultaneous tasks provide innovative real-time capabilities; these include streaming ASR, Simultaneous S2TT and S2ST, and real-time TTS — all with adaptable latency settings using just one model.
GUI Demo
The GUI demo is a visualization marvel. It not only exhibits a seamless model's ability to process ASR, translation, and synthesis simultaneously but also emphasizes StreamSpeech's ease of use and efficiency, making it accessible for users to experience firsthand in a browser environment.
Case Study
StreamSpeech impresses with its practical demonstration. An example shown has the model transcribing and translating French audio into English both simultaneously and offline. While the simultaneous translation keeps up with the flow of speech, the offline version delivers precision.
Technical Requirements
To explore StreamSpeech, a compatible setup is necessary. It demands Python 3.10 and PyTorch 2.0.1, along with installations of the fairseq and SimulEval libraries for effective functioning.
Getting Started Quickly
To start using StreamSpeech, one can download the available models from different platforms. Models cater to translations between languages including French, Spanish, German, and English. Additionally, a Unit-based HiFi-GAN Vocoder, integral for voice synthesis, can be downloaded for an enhanced experience.
Conclusion
StreamSpeech stands as a cutting-edge solution for real-time speech-to-speech translation, showcasing remarkable features and ease of use. With its state-of-the-art performance, simultaneous task capabilities, and continuous updates, StreamSpeech is a project to watch for anyone interested in speech technology advancements.