higgs-audio by boson-ai

Expressive text-to-audio generation model

Created 1 year ago

8,291 stars

Top 6.2% on SourcePulse

View on GitHub

3 Experts Love This Project

Jiaming Song

Chief Scientist at Luma AI

Alex Chen

Cofounder of Nexa AI

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

Higgs Audio v2 is a text-to-audio foundation model designed for expressive and multi-speaker audio generation. It targets researchers and developers seeking advanced TTS capabilities, offering state-of-the-art performance on various benchmarks and novel features like melodic humming and simultaneous speech/music generation.

How It Works

Higgs Audio v2 leverages a "generation variant" architecture trained on over 10 million hours of audio data. Key innovations include an automated annotation pipeline for its AudioVerse dataset, a unified audio tokenizer capturing semantic and acoustic features, and the DualFFN architecture to enhance acoustic modeling with minimal overhead. This approach enables sophisticated prosody adaptation, multi-speaker dialogue, and zero-shot voice cloning.

Quick Start & Requirements

Installation: Recommended via NVIDIA Deep Learning Containers (e.g., nvcr.io/nvidia/pytorch:25.02-py3). Direct installation involves git clone, pip install -r requirements.txt, and pip install -e .. venv, conda, uv, and vLLM options are also provided.
Prerequisites: NVIDIA GPU with at least 24GB memory recommended for optimal performance.
Links: Demo Video, Multilingual Demo, Tokenizer Blog, Architecture Blog.

Highlighted Details

Achieves 75.7% and 55.7% win rates over "gpt-4o-mini-tts" on "Emotions" and "Questions" categories in EmergentTTS-Eval.
Supports zero-shot voice cloning, multi-speaker dialogues, melodic humming, and speech with background music generation.
Evaluated on Seed-TTS Eval, ESD, EmergentTTS-Eval, and a custom Multi-speaker Eval benchmark.
Offers an OpenAI-compatible API server backed by the vLLM engine for higher throughput.

Maintenance & Community

Developed by Boson AI.
Release blog available at https://www.boson.ai/blog/higgs-audio-v2.

Licensing & Compatibility

The primary license is not explicitly stated in the README.
The boson_multimodal/audio_processing/ directory contains code derived from third-party repositories, primarily xcodec, with its own LICENSE file.

Limitations & Caveats