stable-codec by Stability-AI

Transformer-based audio codec for low-bitrate, high-quality audio coding

Created 1 year ago

414 stars

Top 70.7% on SourcePulse

Project Summary

This repository provides Transformer-based audio codecs for high-quality audio at low bitrates, targeting researchers and developers in audio processing and speech coding. It offers state-of-the-art performance for applications like speech synthesis and efficient audio transmission.

How It Works

The Stable Codec family utilizes Transformer architectures with sliding window attention for efficient audio encoding and decoding. It employs a Frequency-Separated Quantization (FSQ) bottleneck, which can be configured post-hoc to reduce token dictionary size, making it compatible with large language models. This approach balances reconstruction quality with compression efficiency.

Quick Start & Requirements

Install: pip install stable-codec
Prerequisites: CUDA >= 12, Python 3.x, PyTorch, flash-attn (required for inference). CPU inference is not supported.
Model weights available on Hugging Face: stabilityai/stable-codec-speech-16k.
Documentation: https://github.com/Stability-AI/stable-codec
Demos: https://stability-ai.github.io/stable-codec-demo/

Highlighted Details

Offers two variants: stable-codec-speech-16k (fine-tuned for downstream tasks) and stable-codec-speech-16k-base (for reproducibility).
Fine-tuned variant includes 500k steps with force-aligned data and CTC loss for improved applicability in TTS.
Supports post-hoc bottleneck configuration for flexible token dictionary sizes and bitrates (e.g., 400bps, 700bps, 1000bps).
Achieves SI-SDR of 3.58 and PESQ of 3.01 for the fine-tuned model.

Maintenance & Community

Developed by Stability AI.
Changelog indicates ongoing development and bug fixes.
Further training details and dataset configuration are available in stable-audio-tools documentation.

Licensing & Compatibility

Code is MIT licensed.
Model weights are covered by the Stability AI Community License.
Compatibility for commercial use or closed-source linking depends on the terms of the Stability AI Community License.

Limitations & Caveats

The model has a hard requirement for FlashAttention, preventing CPU inference and requiring specific GPU hardware. The "stable-codec-speech-16k" variant shows slightly lower objective reconstruction metrics compared to the base model.

stable-codec by Stability-AI

Explore Similar Projects

SoundStorm by yangdongchao

LongCat-Audio-Codec by meituan-longcat

xcodec by zhenye234

VoiceStar by jasonppy

MiraTTS by ysharma3501

X-Codec-2.0 by zhenye234

voicebox-pytorch by lucidrains

tts by inworld-ai

VITA-Audio by VITA-MLLM

WavTokenizer by jishengpeng

audiolm-pytorch by lucidrains

encodec by facebookresearch