This repository curates Korean Speech Recognition (STT) APIs, providing performance benchmarks using Character Error Rate (CER) on public datasets. It targets developers and researchers evaluating STT solutions for Korean language applications, offering objective comparisons to aid in selection and development.
How It Works
The project evaluates several Korean STT APIs, including OpenAI Whisper, Google Cloud Speech-to-text v2, ETRI, Naver Clova Speech, and Return Zero's VITO Speech. Performance is measured using Character Error Rate (CER) against various AI-Hub datasets, which include conversational speech, call center recordings, and educational content. CER is used due to Korean's agglutinative nature and ambiguous word boundaries, making character-level accuracy a more robust metric than Word Error Rate (WER).
Quick Start & Requirements
- Usage: The project primarily presents benchmark results. Direct API usage requires individual API key setup and adherence to each provider's documentation.
- Data: Benchmarks are based on AI-Hub datasets. A sample of 3000 sentences per dataset was used for testing.
- Resources: Evaluating APIs requires API keys and potentially costs associated with API calls.
- Links:
Highlighted Details
- Performance Benchmarks: Detailed CER results are provided for multiple APIs across diverse Korean speech datasets.
- CER vs. WER: Explains the rationale for using CER for Korean, highlighting linguistic challenges with WER.
- Data Normalization: Discusses the impact of varying data normalization practices across datasets on STT model performance.
- API List: Includes APIs readily available for developers without extensive approval processes.
Maintenance & Community
- Contributions: Contributions are welcomed via Issues and Pull Requests. Contact email: research@rtzr.ai.
- Development: Return Zero has contributed to making KsponSpeech available in SpeechBrain and Hugging Face.
Licensing & Compatibility
- License: The repository itself appears to be under an unspecified license, but it compiles and presents data from various services. The underlying datasets and APIs have their own licensing terms.
- Commercial Use: Commercial use depends on the terms of each individual STT API provider listed.
Limitations & Caveats
- Benchmarks are based on a sampled subset (3000 sentences) of datasets, which may not represent full dataset performance.
- Google's API v2 had file size and duration limitations affecting some tests.
- The list of APIs is not exhaustive, with some services like Amazon Transcribe and Microsoft Speech Service noted for future testing.