Speech generation model for conversational AI research
Top 3.7% on sourcepulse
CSM (Conversational Speech Model) is a speech generation model that produces audio from text and audio inputs, targeting researchers and developers building speech applications. It leverages a Llama backbone and a specialized audio decoder to generate RVQ audio codes, enabling high-quality, context-aware speech synthesis.
How It Works
CSM employs a Llama-3.2-1B model as its backbone for processing text and context, coupled with a smaller, dedicated audio decoder that generates Mimi audio codes. This architecture allows for efficient and high-fidelity speech generation, particularly when provided with conversational context, leading to more natural and coherent audio outputs.
Quick Start & Requirements
pip install -r requirements.txt
after cloning the repository.ffmpeg
.Llama-3.2-1B
and CSM-1B
is necessary.huggingface-cli login
.triton-windows
instead of triton
.python run_csm.py
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The model is primarily for research and educational purposes and is not a general-purpose multimodal LLM; it cannot generate text. While it has some capacity for non-English languages due to data contamination, performance is not guaranteed. The README explicitly prohibits impersonation, misinformation, and illegal or harmful activities.
2 months ago
1 week