GLM-ASR  by zai-org

Robust speech recognition model for challenging audio

Created 1 month ago
670 stars

Top 50.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

GLM-ASR-Nano is an open-source automatic speech recognition (ASR) model designed to handle real-world audio complexities. It targets users needing robust transcription, especially for diverse dialects and low-volume speech, offering a competitive alternative to existing models like Whisper V3.

How It Works

This 1.5 billion parameter model employs a robust architecture optimized for challenging acoustic environments. Its design prioritizes exceptional dialect support, including Cantonese, and specialized training for low-volume or quiet speech scenarios. This approach aims to achieve state-of-the-art performance, particularly on Chinese benchmarks, by effectively capturing nuances often missed by conventional ASR systems.

Quick Start & Requirements

Installation involves pip install -r requirements.txt and sudo apt install ffmpeg. Inference can be performed using the transformers library (supporting 5.x), vLLM, or SGLang. Example inference scripts for English and Chinese audio are provided.

Highlighted Details

  • Achieves a state-of-the-art average error rate of 4.10 on Chinese benchmarks.
  • Outperforms OpenAI Whisper V3 on multiple benchmarks while maintaining a compact 1.5B parameter size.
  • Demonstrates exceptional performance on Cantonese and other dialects, as well as low-volume speech.
  • Evaluated on real-world meeting scenarios (Wenet Meeting) and standard Mandarin datasets (Aishell-1).

Maintenance & Community

The project mentions a WeChat community for engagement. No specific details on core maintainers, sponsorships, or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state the license type or any compatibility notes for commercial use.

Limitations & Caveats

The README does not detail specific limitations, alpha status, known bugs, or unsupported platforms. The focus on specific Chinese dialects and low-volume speech might imply limitations in broader language support or standard-volume speech scenarios compared to more general-purpose models.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
18
Star History
247 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.