higgs-audio  by boson-ai

Expressive text-to-audio generation model

Created 2 months ago
7,446 stars

Top 6.9% on SourcePulse

GitHubView on GitHub
Project Summary

Higgs Audio v2 is a text-to-audio foundation model designed for expressive and multi-speaker audio generation. It targets researchers and developers seeking advanced TTS capabilities, offering state-of-the-art performance on various benchmarks and novel features like melodic humming and simultaneous speech/music generation.

How It Works

Higgs Audio v2 leverages a "generation variant" architecture trained on over 10 million hours of audio data. Key innovations include an automated annotation pipeline for its AudioVerse dataset, a unified audio tokenizer capturing semantic and acoustic features, and the DualFFN architecture to enhance acoustic modeling with minimal overhead. This approach enables sophisticated prosody adaptation, multi-speaker dialogue, and zero-shot voice cloning.

Quick Start & Requirements

  • Installation: Recommended via NVIDIA Deep Learning Containers (e.g., nvcr.io/nvidia/pytorch:25.02-py3). Direct installation involves git clone, pip install -r requirements.txt, and pip install -e .. venv, conda, uv, and vLLM options are also provided.
  • Prerequisites: NVIDIA GPU with at least 24GB memory recommended for optimal performance.
  • Links: Demo Video, Multilingual Demo, Tokenizer Blog, Architecture Blog.

Highlighted Details

  • Achieves 75.7% and 55.7% win rates over "gpt-4o-mini-tts" on "Emotions" and "Questions" categories in EmergentTTS-Eval.
  • Supports zero-shot voice cloning, multi-speaker dialogues, melodic humming, and speech with background music generation.
  • Evaluated on Seed-TTS Eval, ESD, EmergentTTS-Eval, and a custom Multi-speaker Eval benchmark.
  • Offers an OpenAI-compatible API server backed by the vLLM engine for higher throughput.

Maintenance & Community

Licensing & Compatibility

  • The primary license is not explicitly stated in the README.
  • The boson_multimodal/audio_processing/ directory contains code derived from third-party repositories, primarily xcodec, with its own LICENSE file.

Limitations & Caveats

  • Optimal performance requires an NVIDIA GPU with at least 24GB memory.
  • The primary license is not clearly specified, which may impact commercial use.
Health Check
Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
10
Star History
198 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.6%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.4%
52k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.