mlx-audio  by Blaizzy

TTS/STT/STS library for efficient speech analysis on Apple Silicon

created 8 months ago
2,523 stars

Top 19.0% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides text-to-speech (TTS) and speech-to-speech (STS) capabilities leveraging Apple's MLX framework for efficient inference on Apple Silicon. It targets developers and users seeking fast, customizable speech synthesis with features like multiple voices, adjustable speed, and an interactive web interface with 3D audio visualization.

How It Works

MLX-Audio utilizes the MLX framework for accelerated computation on Apple Silicon, enabling fast inference for its TTS and STS models. It supports the Kokoro model architecture for multilingual TTS and the CSM model for voice cloning via reference audio. The library offers direct Python API access, a CLI for quick generation, and a FastAPI-based web server with a 3D audio visualizer.

Quick Start & Requirements

  • Install: pip install mlx-audio
  • Web interface/API dependencies: pip install -r requirements.txt
  • Requirements: MLX, Python 3.8+, Apple Silicon Mac (recommended). Japanese/Mandarin support requires misaki[ja] or misaki[zh].
  • Quick Start: https://github.com/Blaizzy/mlx-audio

Highlighted Details

  • Fast inference on Apple Silicon (M series chips).
  • Supports multiple voices (AF Heart, AF Nova, AF Bella, BF Emma) and languages (American English, British English, Japanese, Mandarin Chinese).
  • Interactive web interface with real-time 3D audio visualization and direct output folder access.
  • REST API for TTS generation and audio playback.
  • Voice cloning capability using the CSM model with reference audio.
  • Quantization support for optimized performance.

Maintenance & Community

  • Project maintained by Blaizzy.
  • Uses Kokoro model architecture and Three.js for visualization.
  • No explicit community links (Discord/Slack) or roadmap mentioned in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatible with commercial use and closed-source linking due to the permissive MIT license.

Limitations & Caveats

The README indicates that Japanese and Mandarin Chinese language support require additional misaki package installations. The "open output folder" feature for the web interface is noted to work only when running the server locally.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
4
Star History
1,833 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

GPT-SoVITS by RVC-Boss

0.6%
49k
Few-shot voice cloning and TTS web UI
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.