stt - Enhance Speech to Text Conversion with Offline Tools

Introduction to the STT Project

The STT project is an offline speech-to-text tool designed to transform human speech within audio or video files into written text. Utilizing the open-source model fast-whisper, this tool can generate outputs in JSON format, SRT subtitles with timestamps, or plain text. Unlike some popular online options, it offers a local solution comparable in accuracy to OpenAI’s API or Baidu's speech recognition services.

Available Models

The fast-whisper model comes in various sizes: tiny, base, small, medium, and large-v3. Each progressive model (from tiny to large-v3) improves text recognition accuracy but also requires more computing resources. The tiny model is included by default, but users can download and extract other models into the "models" directory as needed.

Features Highlight

Offline Capability: No need for internet access; fully runs on local machines.
Multi-Format Output: Offers choices in output format, catering to different requirements for processing and utilization.
Self-Deployment: Users can set up the tool as an alternative to major voice-recognition APIs.
Accuracy: Maintains a high level of accuracy, in line with leading API options.

Getting Started

For those interested in using the precompiled version for Windows or deploying from source on Linux or Mac, here’s how to get started.

Windows Precompiled Version

Download: Visit the Releases page and download the desired file.
Extract: Unzip the contents to a location on your computer, such as E:/stt.
Run: Double-click start.exe. This action will automatically open a browser window.
Upload and Configure: Add audio or video files by dragging them into the upload area or by using the upload feature. Choose the spoken language, output format, and desired model.
Recognition: Click "Start Recognition" and wait as the results display in the chosen format.

Source Code Deployment (Linux, Mac, Windows)

Requirements: Ensure Python 3.9 to 3.11 is installed.
Directory Setup: Create a new directory, such as E:/stt, and pull the source using git.
Virtual Environment: Set up a virtual environment and activate it.
Dependencies: Install required dependencies. Add CUDA support if necessary for faster processing.
FFmpeg: On Windows, extract ffmpeg.exe and ffprobe.exe into your project folder. For Linux and Mac, install FFmpeg through native package managers.
Model Download: Download and place model files into the "models" directory.
Launch: Run python start.py and follow prompts to access the local web interface.

API Interface

The tool offers a RESTful API for those looking to integrate speech-to-text functionality into their projects. The endpoint http://127.0.0.1:9977/api accepts POST requests with parameters for language, model, response format, and the audio/video file.

Below is a simple example in Python:

import requests

# API endpoint
url = "http://127.0.0.1:9977/api"

# Prepare the POST request with required parameters and files
files = {"file": open("C:/Users/c1/Videos/2.wav", "rb")}
data = {"language": "zh", "model": "base", "response_format": "json"}

# Execute the API request
response = requests.post(url, data=data, files=files)
print(response.json())

CUDA Acceleration Support

For systems equipped with Nvidia GPUs, CUDA acceleration can significantly speed up the recognition process. Install the CUDA Toolkit and cuDNN library corresponding to your CUDA version. After installation, run nvcc --version and nvidia-smi to verify correct setup.

Should any issues arise, users are encouraged to verify installation steps and settings to ensure correct configuration.

In summary, the STT project provides a robust and flexible local solution for speech-to-text tasks, accommodating a wide spectrum of formats, models, and deployment environments.