whisper.cpp  by ggml-org

C/C++ port for high-performance Whisper ASR inference

created 2 years ago
41,897 stars

Top 0.6% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a high-performance C/C++ implementation of OpenAI's Whisper automatic speech recognition (ASR) model, optimized for various hardware including Apple Silicon, x86, POWER, NVIDIA GPUs, and Intel/Ascend NPUs. It targets developers and researchers needing efficient, on-device speech-to-text capabilities across diverse platforms, offering significant speedups and reduced resource usage through techniques like quantization and mixed-precision inference.

How It Works

The core of the project is built upon the ggml machine learning library, enabling a lightweight, dependency-free C/C++ implementation of the Whisper model. This design facilitates easy integration into various applications and platforms. It leverages hardware-specific optimizations such as ARM NEON, Accelerate framework, Metal, and Core ML for Apple Silicon, and AVX/VSX intrinsics for x86/POWER architectures. The implementation supports mixed F16/F32 precision and integer quantization, minimizing memory allocations and improving inference speed.

Quick Start & Requirements

  • Install/Run: Clone the repository, download a ggml-formatted Whisper model (e.g., sh ./models/download-ggml-model.sh base.en), build the whisper-cli example (cmake -B build && cmake --build build --config Release), and transcribe an audio file (./build/bin/whisper-cli -f samples/jfk.wav).
  • Prerequisites: C++ compiler, CMake, FFmpeg (for non-WAV formats). Optional: CUDA for NVIDIA GPUs, OpenVINO for Intel hardware, Core ML for Apple Neural Engine, etc.
  • Setup Time: Building and downloading a base model is typically under 5 minutes.
  • Links: Official Docs, Models

Highlighted Details

  • Supports a wide range of hardware accelerators: Apple Silicon (Metal, Core ML), NVIDIA (cuBLAS), Intel (OpenVINO), Ascend NPU, Moore Threads GPUs (MUSA), Vulkan.
  • Offers integer quantization (e.g., Q5_0) for reduced memory footprint and faster inference.
  • Provides experimental features like word-level timestamps, speaker segmentation (via tinydiarize), and karaoke-style video generation.
  • Includes bindings for Rust, JavaScript, Go, Java, Ruby, .NET, Python, R, and Unity.
  • Offers a precompiled XCFramework for easy integration into Swift projects.

Maintenance & Community

The project is actively maintained by Georgi Gerganov and the ggml-org community. Discussions are encouraged for feedback and sharing projects.

Licensing & Compatibility

The project is released under the MIT License, allowing for commercial use and integration into closed-source applications.

Limitations & Caveats

The whisper-cli example currently requires 16-bit WAV files; other formats need conversion via FFmpeg. Real-time streaming requires SDL2. Some advanced features like speaker segmentation and karaoke generation are experimental.

Health Check
Last commit

18 hours ago

Responsiveness

1 day

Pull Requests (30d)
21
Issues (30d)
30
Star History
2,428 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

MiniCPM-o by OpenBMB

0.2%
20k
MLLM for vision, speech, and multimodal live streaming on your phone
created 1 year ago
updated 1 month ago
Feedback? Help us improve.