parakeet.cpp  by Frikallo

High-performance C++ speech AI inference engine

Created 1 month ago
257 stars

Top 98.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

parakeet.cpp delivers an ultra-fast, portable C++ implementation for on-device speech recognition, leveraging NVIDIA's Parakeet models. It targets developers and power users seeking high-performance Automatic Speech Recognition (ASR) without the overhead of heavy runtimes like Python or ONNX. The project offers significant speedups, particularly on Apple Silicon GPUs, by utilizing its custom axiom tensor library for Metal acceleration.

How It Works

The project is built on a pure C++ architecture centered around axiom, a lightweight tensor library featuring automatic Metal GPU acceleration. It employs a shared FastConformer encoder and supports diverse decoders, including CTC, TDT, and RNNT, alongside specialized streaming models. This design bypasses traditional dependencies, enabling efficient on-device inference through optimized Metal GPU operations and FP16 support for reduced memory footprint and enhanced speed.

Quick Start & Requirements

  • Primary install/run: Clone recursively (git clone --recursive), build with make build.
  • Prerequisites: C++20 (Clang 14+ or GCC 12+), CMake 3.20+, macOS 13+ for Metal GPU acceleration.
  • Dependencies: axiom (included), safetensors, torch (for weight conversion), dr_libs, stb_vorbis (audio handling).
  • Setup: Requires downloading models from HuggingFace (e.g., nvidia/parakeet-tdt_ctc-110m) and converting them using provided Python scripts (scripts/convert_nemo.py).
  • Links: HuggingFace models, conversion scripts.

Highlighted Details

  • Supports multiple decoders (CTC, TDT, RNNT) with beam search and optional ARPA LM fusion.
  • Provides per-word timestamps and confidence scores.
  • Features phrase boosting for domain-specific vocabulary via token-level tries.
  • Enables batch transcription for multiple audio files.
  • Integrates Silero VAD for preprocessing and speaker diarization via Sortformer.
  • Achieves significant GPU acceleration on Apple Silicon (up to 96x faster) using Metal via axiom.
  • Offers FP16 inference for approximately 2x memory reduction.
  • Includes streaming models (EOU, Nemotron) with configurable latency.
  • Exposes a flat C API for easy integration with other languages.

Maintenance & Community

The project's roadmap is detailed within the README, indicating active development focus. No specific community channels (e.g., Discord, Slack) or notable contributors/sponsorships are mentioned.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: The permissive MIT license allows for broad compatibility, including commercial use and integration into closed-source projects.

Limitations & Caveats

GPU acceleration is currently limited to Apple Silicon hardware running macOS 13+. Offline models have approximate audio length limits of 4-5 minutes; longer audio requires using the dedicated streaming models. Model conversion from HuggingFace .nemo files to the project's .safetensors format is a necessary prerequisite.

Health Check
Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

pyctcdecode by kensho-technologies

0%
469
CTC beam search decoder for speech recognition
Created 4 years ago
Updated 2 years ago
Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
3 more.

voxtral.c by antirez

0.9%
2k
Pure C speech-to-text inference engine for Mistral Voxtral Realtime 4B
Created 2 months ago
Updated 1 month ago
Feedback? Help us improve.