tada  by HumeAI

Generative speech modeling framework

Created 1 month ago
959 stars

Top 38.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

TADA is a unified speech-language model designed to address the computational inefficiencies and transcript hallucination common in traditional Text-to-Speech (TTS) systems. It targets researchers and developers seeking high-fidelity speech synthesis with a more natural flow and reduced computational overhead. The core benefit lies in its novel 1:1 text-acoustic alignment, enabling a more cohesive and efficient speech generation process.

How It Works

TADA utilizes a unique tokenization schema that aligns each text token with a single speech vector, creating a synchronized stream. Its dynamic autoregression allows the model to generate the entire speech segment for a text token in one step, dynamically controlling duration and prosody. This dual-stream generation approach simultaneously produces text tokens and the speech for preceding tokens, maintaining context while significantly lowering computational costs compared to fixed-frame-rate models.

Quick Start & Requirements

Installation is straightforward via pip: pip install hume-tada. Alternatively, clone the repository and install from source using pip install -e .. The project offers models like TADA-1B and TADA-3B-ML. Inference examples indicate a requirement for a CUDA-enabled GPU.

Highlighted Details

  • 1:1 Token Alignment: Achieves precise synchronization between text tokens and corresponding speech vectors.
  • Dynamic Duration Synthesis: Generates speech for each text token in a single autoregressive step, adapting duration and prosody.
  • Dual-Stream Generation: Processes text and speech concurrently, maintaining context and improving efficiency.
  • Multilingual Support: Includes language-specific aligners for Arabic, Chinese, German, Spanish, French, Italian, Japanese, Polish, and Portuguese.
  • Speech Continuation: Allows for generating speech beyond an initial prompt.

Maintenance & Community

This project is developed by Hume AI, an "empathic AI research company." For inquiries regarding product or research collaborations, contact hello@hume.ai. The README does not provide links to community channels like Discord or Slack, nor a public roadmap.

Licensing & Compatibility

The provided README does not explicitly state the software license. This omission necessitates clarification for potential adopters, particularly concerning commercial use or integration within closed-source applications.

Limitations & Caveats

The built-in Automatic Speech Recognition (ASR) used for prompt encoding is exclusively English-only. For non-English prompts, users must supply the corresponding transcript to the encoder to ensure optimal alignment quality, as degraded results may occur otherwise.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
18
Issues (30d)
6
Star History
344 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.0%
4k
TTS model for human-like, expressive speech
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.