Transformer-based audio codec for low-bitrate, high-quality audio coding
Top 75.2% on sourcepulse
This repository provides Transformer-based audio codecs for high-quality audio at low bitrates, targeting researchers and developers in audio processing and speech coding. It offers state-of-the-art performance for applications like speech synthesis and efficient audio transmission.
How It Works
The Stable Codec family utilizes Transformer architectures with sliding window attention for efficient audio encoding and decoding. It employs a Frequency-Separated Quantization (FSQ) bottleneck, which can be configured post-hoc to reduce token dictionary size, making it compatible with large language models. This approach balances reconstruction quality with compression efficiency.
Quick Start & Requirements
pip install stable-codec
flash-attn
(required for inference). CPU inference is not supported.stabilityai/stable-codec-speech-16k
.Highlighted Details
stable-codec-speech-16k
(fine-tuned for downstream tasks) and stable-codec-speech-16k-base
(for reproducibility).Maintenance & Community
stable-audio-tools
documentation.Licensing & Compatibility
Limitations & Caveats
The model has a hard requirement for FlashAttention, preventing CPU inference and requiring specific GPU hardware. The "stable-codec-speech-16k" variant shows slightly lower objective reconstruction metrics compared to the base model.
2 months ago
1 day