csm  by SesameAILabs

Speech generation model for conversational AI research

created 5 months ago
13,847 stars

Top 3.7% on sourcepulse

GitHubView on GitHub
Project Summary

CSM (Conversational Speech Model) is a speech generation model that produces audio from text and audio inputs, targeting researchers and developers building speech applications. It leverages a Llama backbone and a specialized audio decoder to generate RVQ audio codes, enabling high-quality, context-aware speech synthesis.

How It Works

CSM employs a Llama-3.2-1B model as its backbone for processing text and context, coupled with a smaller, dedicated audio decoder that generates Mimi audio codes. This architecture allows for efficient and high-fidelity speech generation, particularly when provided with conversational context, leading to more natural and coherent audio outputs.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires CUDA-compatible GPU (tested on 12.4, 12.6), Python 3.10+, and ffmpeg.
  • Access to Hugging Face models Llama-3.2-1B and CSM-1B is necessary.
  • Login to Hugging Face via huggingface-cli login.
  • Windows users should install triton-windows instead of triton.
  • Official quick-start script: python run_csm.py.
  • API usage examples are available in the README.

Highlighted Details

  • Generates RVQ audio codes from text and audio inputs.
  • Supports conversational context for improved audio quality.
  • Fine-tuned variant powers an interactive voice demo.
  • Base model is capable of producing a variety of voices.

Maintenance & Community

  • Authors include Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.
  • No community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The model is primarily for research and educational purposes and is not a general-purpose multimodal LLM; it cannot generate text. While it has some capacity for non-English languages due to data contamination, performance is not guaranteed. The README explicitly prohibits impersonation, misinformation, and illegal or harmful activities.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
2
Star History
946 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Joe Walnes Joe Walnes(Head of Experimental Projects at Stripe), and
1 more.

chatterbox by resemble-ai

1.6%
10k
Open-source TTS model
created 3 months ago
updated 1 day ago
Feedback? Help us improve.