csm  by SesameAILabs

Speech generation model for conversational AI research

Created 6 months ago
14,065 stars

Top 3.5% on SourcePulse

GitHubView on GitHub
Project Summary

CSM (Conversational Speech Model) is a speech generation model that produces audio from text and audio inputs, targeting researchers and developers building speech applications. It leverages a Llama backbone and a specialized audio decoder to generate RVQ audio codes, enabling high-quality, context-aware speech synthesis.

How It Works

CSM employs a Llama-3.2-1B model as its backbone for processing text and context, coupled with a smaller, dedicated audio decoder that generates Mimi audio codes. This architecture allows for efficient and high-fidelity speech generation, particularly when provided with conversational context, leading to more natural and coherent audio outputs.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires CUDA-compatible GPU (tested on 12.4, 12.6), Python 3.10+, and ffmpeg.
  • Access to Hugging Face models Llama-3.2-1B and CSM-1B is necessary.
  • Login to Hugging Face via huggingface-cli login.
  • Windows users should install triton-windows instead of triton.
  • Official quick-start script: python run_csm.py.
  • API usage examples are available in the README.

Highlighted Details

  • Generates RVQ audio codes from text and audio inputs.
  • Supports conversational context for improved audio quality.
  • Fine-tuned variant powers an interactive voice demo.
  • Base model is capable of producing a variety of voices.

Maintenance & Community

  • Authors include Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.
  • No community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The model is primarily for research and educational purposes and is not a general-purpose multimodal LLM; it cannot generate text. While it has some capacity for non-English languages due to data contamination, performance is not guaranteed. The README explicitly prohibits impersonation, misinformation, and illegal or harmful activities.

Health Check
Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
152 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

AudioGPT by AIGC-Audio

0.0%
10k
Audio processing and generation research project
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.