parakeet.cpp by Frikallo

High-performance C++ speech AI inference engine

Created 4 months ago

296 stars

Top 89.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Jonathan Ragan-Kelley

Professor at MIT

Dan Guido

Cofounder of Trail of Bits

Project Summary

Summary

parakeet.cpp delivers an ultra-fast, portable C++ implementation for on-device speech recognition, leveraging NVIDIA's Parakeet models. It targets developers and power users seeking high-performance Automatic Speech Recognition (ASR) without the overhead of heavy runtimes like Python or ONNX. The project offers significant speedups, particularly on Apple Silicon GPUs, by utilizing its custom axiom tensor library for Metal acceleration.

How It Works

The project is built on a pure C++ architecture centered around axiom, a lightweight tensor library featuring automatic Metal GPU acceleration. It employs a shared FastConformer encoder and supports diverse decoders, including CTC, TDT, and RNNT, alongside specialized streaming models. This design bypasses traditional dependencies, enabling efficient on-device inference through optimized Metal GPU operations and FP16 support for reduced memory footprint and enhanced speed.

Quick Start & Requirements

Primary install/run: Clone recursively (git clone --recursive), build with make build.
Prerequisites: C++20 (Clang 14+ or GCC 12+), CMake 3.20+, macOS 13+ for Metal GPU acceleration.
Dependencies: axiom (included), safetensors, torch (for weight conversion), dr_libs, stb_vorbis (audio handling).
Setup: Requires downloading models from HuggingFace (e.g., nvidia/parakeet-tdt_ctc-110m) and converting them using provided Python scripts (scripts/convert_nemo.py).
Links: HuggingFace models, conversion scripts.

Highlighted Details

Supports multiple decoders (CTC, TDT, RNNT) with beam search and optional ARPA LM fusion.
Provides per-word timestamps and confidence scores.
Features phrase boosting for domain-specific vocabulary via token-level tries.
Enables batch transcription for multiple audio files.
Integrates Silero VAD for preprocessing and speaker diarization via Sortformer.
Achieves significant GPU acceleration on Apple Silicon (up to 96x faster) using Metal via axiom.
Offers FP16 inference for approximately 2x memory reduction.
Includes streaming models (EOU, Nemotron) with configurable latency.
Exposes a flat C API for easy integration with other languages.

Maintenance & Community

The project's roadmap is detailed within the README, indicating active development focus. No specific community channels (e.g., Discord, Slack) or notable contributors/sponsorships are mentioned.

Licensing & Compatibility

License: MIT.
Compatibility: The permissive MIT license allows for broad compatibility, including commercial use and integration into closed-source projects.

Limitations & Caveats

GPU acceleration is currently limited to Apple Silicon hardware running macOS 13+. Offline models have approximate audio length limits of 4-5 minutes; longer audio requires using the dedicated streaming models. Model conversion from HuggingFace .nemo files to the project's .safetensors format is a necessary prerequisite.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days