quillman - Real-time Voice Interaction with Innovative Speech-to-Speech Model

QuiLLMan: Voice Chat with Moshi

QuiLLMan is an innovative voice chat application that utilizes advanced technology to facilitate seamless communication. At its core is a speech-to-speech language model supported by continuous bidirectional streaming, ensuring an engaging and responsive user experience.

How QuiLLMan Works

The backend of QuiLLMan is driven by Kyutai Lab's Moshi model. This model continuously listens to the user, devises responses, and communicates them effectively. The process is enhanced by the Mimi streaming encoder/decoder, which ensures a consistent and uninterrupted audio stream, both for input and output. Additionally, a speech-text foundation model determines the best moments to respond, optimizing interaction flow.

A notable feature of QuiLLMan is its use of bidirectional websocket streaming and the Opus audio codec for audio compression over the network. This technology translates into nearly instantaneous response times, mimicking the natural rhythm of human conversation.

For those interested in experiencing QuiLLMan firsthand, a live demo is available here.

Quillman

A Launchpad for Innovation

QuiLLMan is designed not only as a standalone application but also as a foundation for developing your own language model-based projects. It serves as a rich environment for technological experimentation and invites contributions from developers eager to explore new frontiers in voice chat applications.

Setting Up QuiLLMan Locally

For those looking to explore QuiLLMan's capabilities, setting up a local development environment is straightforward:

Requirements

Install modal in your current Python environment (pip install modal).
Sign up for a Modal account and configure it (modal setup).
Generate and set up a Modal token (modal token new).

Inference Module Development

QuiLLMan's Moshi server functions as a Modal class module. It loads models and manages streaming state through a FastAPI HTTP server to expose a websocket interface online. To initiate development mode, use the command:

modal serve src.moshi

Monitor the terminal output for a websocket connection URL and note that any project file updates apply automatically. To stop the app, press Ctrl+C.

Testing the Websocket Connection

From another terminal, you can directly test the websocket connection with tests/moshi_client.py. First, install the necessary dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements/requirements-dev.txt

Then, execute the terminal client and commence interaction:

python tests/moshi_client.py

Ensure your microphone and speakers are functional.

Frontend and HTTP Server Development

The application’s HTTP server is established at src/app.py, using FastAPI to serve static frontend files. To start a development server, execute:

modal serve src.app

Since src/app.py incorporates src/moshi.py, this command will also launch the Moshi websocket server. Any changes to project files reflect automatically, but clearing the browser cache may be required for frontend modifications.

Deploying on Modal

Once development is complete, you can deploy the app to Modal:

modal deploy src.app

This deployment includes both the frontend server and the Moshi websocket server. Remember, deploying on Modal incurs no cost as the platform is serverless and scales to zero when inactive.

QuiLLMan not only showcases cutting-edge technology in voice communication but also empowers developers to push the boundaries of what's possible within voice chat applications.