higgs-audio  by boson-ai

Expressive text-to-audio generation model

Created 6 months ago
7,879 stars

Top 6.6% on SourcePulse

GitHubView on GitHub
Project Summary

Higgs Audio v2 is a text-to-audio foundation model designed for expressive and multi-speaker audio generation. It targets researchers and developers seeking advanced TTS capabilities, offering state-of-the-art performance on various benchmarks and novel features like melodic humming and simultaneous speech/music generation.

How It Works

Higgs Audio v2 leverages a "generation variant" architecture trained on over 10 million hours of audio data. Key innovations include an automated annotation pipeline for its AudioVerse dataset, a unified audio tokenizer capturing semantic and acoustic features, and the DualFFN architecture to enhance acoustic modeling with minimal overhead. This approach enables sophisticated prosody adaptation, multi-speaker dialogue, and zero-shot voice cloning.

Quick Start & Requirements

  • Installation: Recommended via NVIDIA Deep Learning Containers (e.g., nvcr.io/nvidia/pytorch:25.02-py3). Direct installation involves git clone, pip install -r requirements.txt, and pip install -e .. venv, conda, uv, and vLLM options are also provided.
  • Prerequisites: NVIDIA GPU with at least 24GB memory recommended for optimal performance.
  • Links: Demo Video, Multilingual Demo, Tokenizer Blog, Architecture Blog.

Highlighted Details

  • Achieves 75.7% and 55.7% win rates over "gpt-4o-mini-tts" on "Emotions" and "Questions" categories in EmergentTTS-Eval.
  • Supports zero-shot voice cloning, multi-speaker dialogues, melodic humming, and speech with background music generation.
  • Evaluated on Seed-TTS Eval, ESD, EmergentTTS-Eval, and a custom Multi-speaker Eval benchmark.
  • Offers an OpenAI-compatible API server backed by the vLLM engine for higher throughput.

Maintenance & Community

Licensing & Compatibility

  • The primary license is not explicitly stated in the README.
  • The boson_multimodal/audio_processing/ directory contains code derived from third-party repositories, primarily xcodec, with its own LICENSE file.

Limitations & Caveats

  • Optimal performance requires an NVIDIA GPU with at least 24GB memory.
  • The primary license is not clearly specified, which may impact commercial use.
Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
9
Star History
119 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.3%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 10 months ago
Updated 1 month ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.4%
54k
Few-shot voice cloning and TTS web UI
Created 2 years ago
Updated 4 weeks ago
Feedback? Help us improve.