csm by SesameAILabs

Speech generation model for conversational AI research

Created 10 months ago

14,431 stars

Top 3.4% on SourcePulse

View on GitHub

6 Experts Love This Project

Cofounder of Hugging Face

Sam Spelsberg

Cofounder of Delphi

and 2 more!

Project Summary

CSM (Conversational Speech Model) is a speech generation model that produces audio from text and audio inputs, targeting researchers and developers building speech applications. It leverages a Llama backbone and a specialized audio decoder to generate RVQ audio codes, enabling high-quality, context-aware speech synthesis.

How It Works

CSM employs a Llama-3.2-1B model as its backbone for processing text and context, coupled with a smaller, dedicated audio decoder that generates Mimi audio codes. This architecture allows for efficient and high-fidelity speech generation, particularly when provided with conversational context, leading to more natural and coherent audio outputs.

Quick Start & Requirements

Install via pip install -r requirements.txt after cloning the repository.
Requires CUDA-compatible GPU (tested on 12.4, 12.6), Python 3.10+, and ffmpeg.
Access to Hugging Face models Llama-3.2-1B and CSM-1B is necessary.
Login to Hugging Face via huggingface-cli login.
Windows users should install triton-windows instead of triton.
Official quick-start script: python run_csm.py.
API usage examples are available in the README.

Highlighted Details

Generates RVQ audio codes from text and audio inputs.
Supports conversational context for improved audio quality.
Fine-tuned variant powers an interactive voice demo.
Base model is capable of producing a variety of voices.

Maintenance & Community

Authors include Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.
No community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The model is primarily for research and educational purposes and is not a general-purpose multimodal LLM; it cannot generate text. While it has some capacity for non-English languages due to data contamination, performance is not guaranteed. The README explicitly prohibits impersonation, misinformation, and illegal or harmful activities.

Health Check

Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

110 stars in the last 30 days