MahaTTS  by dubverse-ai

Open-source TTS model for multilingual voice cloning

Created 1 year ago
274 stars

Top 94.3% on SourcePulse

GitHubView on GitHub
Project Summary

MahaTTS is an open-source, large-scale text-to-speech (TTS) model developed by Dubverse.ai, offering multilingual voice cloning and cross-lingual prosody transfer. It is designed for researchers and developers seeking advanced speech synthesis capabilities, including zero-shot voice cloning and style transfer across languages, with pre-trained checkpoints available for commercial use.

How It Works

MahaTTS draws inspiration from Tortoise TTS but uniquely employs seamless M4t wav2vec2 for semantic token extraction. This multilingual training of wav2vec2 enhances the model's scalability across various languages. The architecture comprises a Text-to-Semantic model (84M parameters, Causal LM), a Semantic-to-MelSpec diffusion model (430M parameters), and a HiFi-GAN vocoder (13M parameters) for audio waveform generation.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/dubverse-ai/MahaTTS.git
  • Requires PyTorch and a CUDA-enabled GPU for optimal performance.
  • Example usage and pretrained models are available on Hugging Face.
  • Colab notebook provided for quick experimentation: Open In Colab

Highlighted Details

  • Supports voice cloning in multiple seen and unseen speaker identities.
  • Enables multilingual and cross-lingual voice cloning with prosody transfer.
  • Released "Smolie English" (9k hours English data) and "Smolie Indic" (400 hours, 9 Indian languages).
  • Future plans include a 1B parameter model trained on 20K hours across 15 languages.

Maintenance & Community

  • Project is actively under development, with ongoing work to improve robustness and reduce latency.
  • Updates may take time as they train larger models.
  • Contributions for inference optimization are welcomed.

Licensing & Compatibility

  • Licensed under the Apache 2.0 License.
  • Pretrained model checkpoints are available for commercial use.

Limitations & Caveats

Latency is noted as an ongoing issue. The project is actively training larger models, suggesting potential for breaking changes or API shifts in future releases.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.2%
34k
Audio foundation model for versatile, instant voice cloning
Created 1 year ago
Updated 5 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.