csm-mlx  by senstella

Speech model for Apple Silicon using MLX

Created 6 months ago
377 stars

Top 75.4% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an implementation of the Conversation Speech Model (CSM) for Apple Silicon, leveraging the MLX framework. It enables text-to-speech generation with conversational context and offers a command-line interface (CLI) for easy usage and fine-tuning.

How It Works

The implementation utilizes the MLX framework, optimized for Apple Silicon hardware. It supports loading pre-trained weights from Hugging Face and allows for quantization to improve inference speed, potentially enabling near real-time generation. The model can incorporate conversational context through previous audio segments and text transcripts to produce more natural-sounding speech.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/senstella/csm-mlx --upgrade or uv add git+https://github.com/senstella/csm-mlx --upgrade.
  • Python Version: Python < 3.13 recommended due to potential sentencepiece compiler errors.
  • Dependencies: MLX, Hugging Face Hub, audiofile, audresample.
  • Weights: Download from Hugging Face Hub (senstella/csm-1b-mlx).
  • CLI Installation: uv tool install "git+https://github.com/senstella/csm-mlx[cli]" --upgrade or pipx install "git+https://github.com/senstella/csm-mlx[cli]" --upgrade.
  • Docs: https://ml-explore.github.io/mlx/build/html/usage/using_streams.html

Highlighted Details

  • Supports conversational context for more natural speech generation.
  • Offers quantization for performance gains on Apple Silicon.
  • Includes a CLI for basic TTS, context-based generation, and fine-tuning.
  • Provides streaming generation capabilities for chunked audio output.

Maintenance & Community

The project acknowledges contributions from Sesame, Moshi, torchtune, MLX, typer, audiofile, and audresample. There is no explicit mention of community channels like Discord or Slack.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project notes that quantization, while speeding up inference, may lead to a loss in audio quality. There is an open TODO item to implement watermarking and further optimize performance for real-time inference. A known issue exists with sentencepiece compilation on Python versions >= 3.13.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.