speech-to-text-benchmark by Picovoice

STT benchmark framework for comparing speech-to-text engines

Created 7 years ago

680 stars

Top 49.8% on SourcePulse

Project Summary

This repository provides a minimalist and extensible framework for benchmarking various speech-to-text (STT) engines. It is designed for researchers and developers who need to compare the performance of different STT solutions across multiple datasets and languages, offering metrics like Word Error Rate (WER), Core-Hour, and Model Size.

How It Works

The framework orchestrates the process of transcribing audio files using different STT engines and then evaluates the results against reference transcripts. It supports several popular STT services (Amazon Transcribe, Azure Speech-to-Text, Google Speech-to-Text, IBM Watson) and open-source models (OpenAI Whisper, Picovoice Cheetah/Leopard). The evaluation is based on standard metrics, providing a quantitative comparison of accuracy and computational efficiency.

Quick Start & Requirements

Install FFmpeg.
Install Python dependencies: pip3 install -r requirements.txt.
Tested on Ubuntu 22.04.
Requires API keys/credentials for cloud-based engines and potentially model files for local engines.

Highlighted Details

Benchmarks WER for English, French, German, Italian, Spanish, and Portuguese.
Includes "Core-Hour" and "Model Size" metrics for offline engines, with detailed performance data on a specific AMD CPU configuration.
Offers direct comparison tables for WER across multiple datasets and languages.
Supports various Whisper model sizes and Picovoice engines.

Maintenance & Community

Developed by Picovoice.
No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The repository itself appears to be under a permissive license, but the underlying STT engines have their own licensing terms and usage costs.
Commercial use depends on the licensing of the individual STT services and models being benchmarked.

Limitations & Caveats

Benchmarking results are specific to the hardware and configuration used for testing.
Cloud-based engines are excluded from Core-Hour and Model Size metrics.
Some engines (e.g., Whisper Large) were omitted from specific benchmark tests.

Health Check

Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

AudioBench by AudioLLMs

A universal benchmark for evaluating audio large language models

Created 1 year ago

Updated 8 months ago

UltraEval-Audio by OpenBMB

Unified framework for comprehensive audio foundation model evaluation

Created 1 year ago

Updated 3 weeks ago

speech-recognition-uk by egorsmkv

Resource collection for Ukrainian speech AI

Created 5 years ago

Updated 5 months ago

Awesome-Korean-Speech-Recognition by rtzr

Korean STT API benchmark, datasets, and character error rates

Created 2 years ago

Updated 6 months ago

TheWhisper by TheStageAI

Optimized speech-to-text inference for streaming and on-device use

Created 4 months ago

Updated 3 weeks ago

Leaderboard by SpeechColab

ASR benchmarking platform

Created 5 years ago

Updated 11 months ago

obs-localvocal by royshil

OBS plugin for local speech recognition and captioning

Created 2 years ago

Updated 1 week ago

index-tts-vllm by Ksuriuri

Accelerated TTS inference with vLLM

Created 9 months ago

Updated 4 months ago

sherpa-ncnn by k2-fsa

Offline STT engine for real-time speech recognition and VAD

Created 3 years ago

Updated 4 months ago

supertonic by supertone-inc

Lightning-fast, on-device Text-to-Speech (TTS)

Created 3 months ago

Updated 1 month ago

Starred by

Georgi Gerganov

Georgi Gerganov(Author of llama.cpp, whisper.cpp).

Whisper by Const-me

GPGPU inference for OpenAI's Whisper ASR model

Created 3 years ago

Updated 1 year ago

Starred by

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen) and

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

wenet by wenet-e2e

ASR toolkit for production-ready end-to-end speech recognition

Created 5 years ago

Updated 2 months ago

Feedback? Help us improve.