csm-mlx by senstella

Speech model for Apple Silicon using MLX

Created 10 months ago

392 stars

Top 73.4% on SourcePulse

Project Summary

This project provides an implementation of the Conversation Speech Model (CSM) for Apple Silicon, leveraging the MLX framework. It enables text-to-speech generation with conversational context and offers a command-line interface (CLI) for easy usage and fine-tuning.

How It Works

The implementation utilizes the MLX framework, optimized for Apple Silicon hardware. It supports loading pre-trained weights from Hugging Face and allows for quantization to improve inference speed, potentially enabling near real-time generation. The model can incorporate conversational context through previous audio segments and text transcripts to produce more natural-sounding speech.

Quick Start & Requirements

Installation: pip install git+https://github.com/senstella/csm-mlx --upgrade or uv add git+https://github.com/senstella/csm-mlx --upgrade.
Python Version: Python < 3.13 recommended due to potential sentencepiece compiler errors.
Dependencies: MLX, Hugging Face Hub, audiofile, audresample.
Weights: Download from Hugging Face Hub (senstella/csm-1b-mlx).
CLI Installation: uv tool install "git+https://github.com/senstella/csm-mlx[cli]" --upgrade or pipx install "git+https://github.com/senstella/csm-mlx[cli]" --upgrade.
Docs: https://ml-explore.github.io/mlx/build/html/usage/using_streams.html

Highlighted Details

Supports conversational context for more natural speech generation.
Offers quantization for performance gains on Apple Silicon.
Includes a CLI for basic TTS, context-based generation, and fine-tuning.
Provides streaming generation capabilities for chunked audio output.

Maintenance & Community

The project acknowledges contributions from Sesame, Moshi, torchtune, MLX, typer, audiofile, and audresample. There is no explicit mention of community channels like Discord or Slack.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project notes that quantization, while speeding up inference, may lead to a loss in audio quality. There is an open TODO item to implement watermarking and further optimize performance for real-time inference. A known issue exists with sentencepiece compilation on Python versions >= 3.13.

csm-mlx by senstella

Explore Similar Projects

VITA-Audio by VITA-MLLM

f5-tts-mlx by lucasnewman

dia2 by nari-labs

fast-voice-assistant by dsa

soundstorm-pytorch by lucidrains

hibiki by kyutai-labs

audiolm-pytorch by lucidrains

mini-omni by gpt-omni

KittenTTS by KittenML

Zonos by Zyphra

piper by rhasspy

Spark-TTS by SparkAudio