csm-mlx  by senstella

Speech model for Apple Silicon using MLX

created 4 months ago
367 stars

Top 78.0% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an implementation of the Conversation Speech Model (CSM) for Apple Silicon, leveraging the MLX framework. It enables text-to-speech generation with conversational context and offers a command-line interface (CLI) for easy usage and fine-tuning.

How It Works

The implementation utilizes the MLX framework, optimized for Apple Silicon hardware. It supports loading pre-trained weights from Hugging Face and allows for quantization to improve inference speed, potentially enabling near real-time generation. The model can incorporate conversational context through previous audio segments and text transcripts to produce more natural-sounding speech.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/senstella/csm-mlx --upgrade or uv add git+https://github.com/senstella/csm-mlx --upgrade.
  • Python Version: Python < 3.13 recommended due to potential sentencepiece compiler errors.
  • Dependencies: MLX, Hugging Face Hub, audiofile, audresample.
  • Weights: Download from Hugging Face Hub (senstella/csm-1b-mlx).
  • CLI Installation: uv tool install "git+https://github.com/senstella/csm-mlx[cli]" --upgrade or pipx install "git+https://github.com/senstella/csm-mlx[cli]" --upgrade.
  • Docs: https://ml-explore.github.io/mlx/build/html/usage/using_streams.html

Highlighted Details

  • Supports conversational context for more natural speech generation.
  • Offers quantization for performance gains on Apple Silicon.
  • Includes a CLI for basic TTS, context-based generation, and fine-tuning.
  • Provides streaming generation capabilities for chunked audio output.

Maintenance & Community

The project acknowledges contributions from Sesame, Moshi, torchtune, MLX, typer, audiofile, and audresample. There is no explicit mention of community channels like Discord or Slack.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project notes that quantization, while speeding up inference, may lead to a loss in audio quality. There is an open TODO item to implement watermarking and further optimize performance for real-time inference. A known issue exists with sentencepiece compilation on Python versions >= 3.13.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
41 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Feedback? Help us improve.