insanely-fast-whisper-api - High-Speed Audio Transcription API with Flexible Deployment Options

Introduction to Insanely Fast Whisper API

The Insanely Fast Whisper API provides a rapid solution for audio transcription using OpenAI's Whisper Large v3 model. This powerful API is built with advanced technologies such as 🤗 Transformers, Optimum, and flash-attn to deliver exceptional performance and scalability.

Key Features

Lightning-Fast Transcription: Transform audio into text at remarkable speeds, optimizing both efficiency and productivity.
Open Source & Deployable: Completely open-source, allowing deployment on any GPU-equipped cloud platform.
Speaker Diarization: It includes a feature that distinguishes between different speakers in an audio clip, enhancing transcription accuracy.
User-Friendly API: Designed with an easy-to-use and fast API layer, simplifying integration.
Background Tasks & Webhooks: Supports asynchronous processing and utilizes webhooks to deliver results upon task completion.
Optimized for Performance: Engineered for high concurrency and parallel processing, significantly reducing processing time.
Task Management: Provides endpoints for task management, including starting, checking status, and canceling operations.
Secure Access: Admin authentication ensures secure API access, protecting your projects and data.
Scalable Managed Service: Available as a fully managed API through JigsawStack, providing superior scalability and reduced costs.

Technical Backbone

This project continues the work of the Insanely Fast Whisper CLI, extending its capabilities to cloud infrastructures with GPU support for production environments. It is fully dockerized for easy deployment across various platforms like Fly.io, which recently launched GPU services, making it a seamless fit for deploying this API.

Benchmark Performance

The API shows impressive results in speed benchmarks, processing 150 minutes of audio in under two minutes even with additional configurations like speaker diarization. The benchmarks reflect speed enhanced by optimizations such as batching, Flash Attention, and precision settings (fp16).

Deployment Guide

The Insanely Fast Whisper API can be easily deployed using Docker. The provided Docker image can be pulled from Docker Hub:

yoeven/insanely-fast-whisper-api:latest

For deployment on Fly.io, ensuring access to Fly GPUs is essential. After cloning the project and configuring necessary settings, using the Fly CLI, you can launch a new fly app:

fly launch

Additionally, you can integrate speaker diarization and secure the API by setting environment secrets with:

fly secrets set ADMIN_KEY=<your_token> HF_TOKEN=<your_hf_key>

This API can also be deployed on other cloud platforms supporting Docker, providing you with the flexibility to choose based on your infrastructure.

Using the API

To authenticate, set an ADMIN_KEY during setup. Use x-admin-api-key in the request headers. The API supports various endpoints for transcription tasks:

POST /: Send an audio URL for transcription with additional parameters for processing options like task type (transcribe or translate), language, batch size, and optional webhooks for asynchronous communication.
GET /tasks: Fetch all active transcription tasks.
GET /status/{task_id}: Check the status of specific tasks.
DELETE /cancel/{task_id}: Cancel asynchronous transcription jobs if needed.

Running Locally

For a local setup, clone the repository and follow the included instructions to set up a Python environment and dependencies, before running the application using Uvicorn:

uvicorn app.app:app --reload

Power Management

To control costs, especially on services like Fly.io, the API can be programmatically shut down after use, minimizing idling time and related charges.

Conclusion

The Insanely Fast Whisper API represents a synthesis of cutting-edge AI tools and practical deployment strategies, making it an indispensable tool for developers requiring fast and accurate audio transcription services. With its robust feature set and commitment to open-source development, it serves as a flexible and efficient solution in the realm of speech-to-text technologies.