Introduction to the STT Project
The STT project is an offline speech-to-text tool designed to transform human speech within audio or video files into written text. Utilizing the open-source model fast-whisper, this tool can generate outputs in JSON format, SRT subtitles with timestamps, or plain text. Unlike some popular online options, it offers a local solution comparable in accuracy to OpenAI’s API or Baidu's speech recognition services.
Available Models
The fast-whisper model comes in various sizes: tiny, base, small, medium, and large-v3. Each progressive model (from tiny to large-v3) improves text recognition accuracy but also requires more computing resources. The tiny model is included by default, but users can download and extract other models into the "models" directory as needed.
Features Highlight
- Offline Capability: No need for internet access; fully runs on local machines.
- Multi-Format Output: Offers choices in output format, catering to different requirements for processing and utilization.
- Self-Deployment: Users can set up the tool as an alternative to major voice-recognition APIs.
- Accuracy: Maintains a high level of accuracy, in line with leading API options.
Getting Started
For those interested in using the precompiled version for Windows or deploying from source on Linux or Mac, here’s how to get started.
Windows Precompiled Version
- Download: Visit the Releases page and download the desired file.
- Extract: Unzip the contents to a location on your computer, such as E:/stt.
- Run: Double-click
start.exe
. This action will automatically open a browser window. - Upload and Configure: Add audio or video files by dragging them into the upload area or by using the upload feature. Choose the spoken language, output format, and desired model.
- Recognition: Click "Start Recognition" and wait as the results display in the chosen format.
Source Code Deployment (Linux, Mac, Windows)
- Requirements: Ensure Python 3.9 to 3.11 is installed.
- Directory Setup: Create a new directory, such as E:/stt, and pull the source using git.
- Virtual Environment: Set up a virtual environment and activate it.
- Dependencies: Install required dependencies. Add CUDA support if necessary for faster processing.
- FFmpeg: On Windows, extract
ffmpeg.exe
andffprobe.exe
into your project folder. For Linux and Mac, install FFmpeg through native package managers. - Model Download: Download and place model files into the "models" directory.
- Launch: Run
python start.py
and follow prompts to access the local web interface.
API Interface
The tool offers a RESTful API for those looking to integrate speech-to-text functionality into their projects. The endpoint http://127.0.0.1:9977/api
accepts POST requests with parameters for language, model, response format, and the audio/video file.
Below is a simple example in Python:
import requests
# API endpoint
url = "http://127.0.0.1:9977/api"
# Prepare the POST request with required parameters and files
files = {"file": open("C:/Users/c1/Videos/2.wav", "rb")}
data = {"language": "zh", "model": "base", "response_format": "json"}
# Execute the API request
response = requests.post(url, data=data, files=files)
print(response.json())
CUDA Acceleration Support
For systems equipped with Nvidia GPUs, CUDA acceleration can significantly speed up the recognition process. Install the CUDA Toolkit and cuDNN library corresponding to your CUDA version. After installation, run nvcc --version
and nvidia-smi
to verify correct setup.
Should any issues arise, users are encouraged to verify installation steps and settings to ensure correct configuration.
In summary, the STT project provides a robust and flexible local solution for speech-to-text tasks, accommodating a wide spectrum of formats, models, and deployment environments.