Expressive text-to-audio generation model
New!
Top 7.5% on SourcePulse
Higgs Audio v2 is a text-to-audio foundation model designed for expressive and multi-speaker audio generation. It targets researchers and developers seeking advanced TTS capabilities, offering state-of-the-art performance on various benchmarks and novel features like melodic humming and simultaneous speech/music generation.
How It Works
Higgs Audio v2 leverages a "generation variant" architecture trained on over 10 million hours of audio data. Key innovations include an automated annotation pipeline for its AudioVerse dataset, a unified audio tokenizer capturing semantic and acoustic features, and the DualFFN architecture to enhance acoustic modeling with minimal overhead. This approach enables sophisticated prosody adaptation, multi-speaker dialogue, and zero-shot voice cloning.
Quick Start & Requirements
nvcr.io/nvidia/pytorch:25.02-py3
). Direct installation involves git clone
, pip install -r requirements.txt
, and pip install -e .
. venv, conda, uv, and vLLM options are also provided.Highlighted Details
Maintenance & Community
Licensing & Compatibility
boson_multimodal/audio_processing/
directory contains code derived from third-party repositories, primarily xcodec, with its own LICENSE file.Limitations & Caveats
1 week ago
Inactive