Speech model for Apple Silicon using MLX
Top 78.0% on sourcepulse
This project provides an implementation of the Conversation Speech Model (CSM) for Apple Silicon, leveraging the MLX framework. It enables text-to-speech generation with conversational context and offers a command-line interface (CLI) for easy usage and fine-tuning.
How It Works
The implementation utilizes the MLX framework, optimized for Apple Silicon hardware. It supports loading pre-trained weights from Hugging Face and allows for quantization to improve inference speed, potentially enabling near real-time generation. The model can incorporate conversational context through previous audio segments and text transcripts to produce more natural-sounding speech.
Quick Start & Requirements
pip install git+https://github.com/senstella/csm-mlx --upgrade
or uv add git+https://github.com/senstella/csm-mlx --upgrade
.sentencepiece
compiler errors.audiofile
, audresample
.senstella/csm-1b-mlx
).uv tool install "git+https://github.com/senstella/csm-mlx[cli]" --upgrade
or pipx install "git+https://github.com/senstella/csm-mlx[cli]" --upgrade
.Highlighted Details
Maintenance & Community
The project acknowledges contributions from Sesame, Moshi, torchtune, MLX, typer, audiofile, and audresample. There is no explicit mention of community channels like Discord or Slack.
Licensing & Compatibility
Limitations & Caveats
The project notes that quantization, while speeding up inference, may lead to a loss in audio quality. There is an open TODO item to implement watermarking and further optimize performance for real-time inference. A known issue exists with sentencepiece
compilation on Python versions >= 3.13.
2 months ago
Inactive