higgs-audio  by boson-ai

Expressive text-to-audio generation model

created 3 weeks ago

New!

6,775 stars

Top 7.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Higgs Audio v2 is a text-to-audio foundation model designed for expressive and multi-speaker audio generation. It targets researchers and developers seeking advanced TTS capabilities, offering state-of-the-art performance on various benchmarks and novel features like melodic humming and simultaneous speech/music generation.

How It Works

Higgs Audio v2 leverages a "generation variant" architecture trained on over 10 million hours of audio data. Key innovations include an automated annotation pipeline for its AudioVerse dataset, a unified audio tokenizer capturing semantic and acoustic features, and the DualFFN architecture to enhance acoustic modeling with minimal overhead. This approach enables sophisticated prosody adaptation, multi-speaker dialogue, and zero-shot voice cloning.

Quick Start & Requirements

  • Installation: Recommended via NVIDIA Deep Learning Containers (e.g., nvcr.io/nvidia/pytorch:25.02-py3). Direct installation involves git clone, pip install -r requirements.txt, and pip install -e .. venv, conda, uv, and vLLM options are also provided.
  • Prerequisites: NVIDIA GPU with at least 24GB memory recommended for optimal performance.
  • Links: Demo Video, Multilingual Demo, Tokenizer Blog, Architecture Blog.

Highlighted Details

  • Achieves 75.7% and 55.7% win rates over "gpt-4o-mini-tts" on "Emotions" and "Questions" categories in EmergentTTS-Eval.
  • Supports zero-shot voice cloning, multi-speaker dialogues, melodic humming, and speech with background music generation.
  • Evaluated on Seed-TTS Eval, ESD, EmergentTTS-Eval, and a custom Multi-speaker Eval benchmark.
  • Offers an OpenAI-compatible API server backed by the vLLM engine for higher throughput.

Maintenance & Community

Licensing & Compatibility

  • The primary license is not explicitly stated in the README.
  • The boson_multimodal/audio_processing/ directory contains code derived from third-party repositories, primarily xcodec, with its own LICENSE file.

Limitations & Caveats

  • Optimal performance requires an NVIDIA GPU with at least 24GB memory.
  • The primary license is not clearly specified, which may impact commercial use.
Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
28
Issues (30d)
92
Star History
6,788 stars in the last 27 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

ultravox by fixie-ai

0.2%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 1 week ago
Feedback? Help us improve.