MahaTTS  by dubverse-ai

Open-source TTS model for multilingual voice cloning

created 1 year ago
273 stars

Top 95.3% on sourcepulse

GitHubView on GitHub
Project Summary

MahaTTS is an open-source, large-scale text-to-speech (TTS) model developed by Dubverse.ai, offering multilingual voice cloning and cross-lingual prosody transfer. It is designed for researchers and developers seeking advanced speech synthesis capabilities, including zero-shot voice cloning and style transfer across languages, with pre-trained checkpoints available for commercial use.

How It Works

MahaTTS draws inspiration from Tortoise TTS but uniquely employs seamless M4t wav2vec2 for semantic token extraction. This multilingual training of wav2vec2 enhances the model's scalability across various languages. The architecture comprises a Text-to-Semantic model (84M parameters, Causal LM), a Semantic-to-MelSpec diffusion model (430M parameters), and a HiFi-GAN vocoder (13M parameters) for audio waveform generation.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/dubverse-ai/MahaTTS.git
  • Requires PyTorch and a CUDA-enabled GPU for optimal performance.
  • Example usage and pretrained models are available on Hugging Face.
  • Colab notebook provided for quick experimentation: Open In Colab

Highlighted Details

  • Supports voice cloning in multiple seen and unseen speaker identities.
  • Enables multilingual and cross-lingual voice cloning with prosody transfer.
  • Released "Smolie English" (9k hours English data) and "Smolie Indic" (400 hours, 9 Indian languages).
  • Future plans include a 1B parameter model trained on 20K hours across 15 languages.

Maintenance & Community

  • Project is actively under development, with ongoing work to improve robustness and reduce latency.
  • Updates may take time as they train larger models.
  • Contributions for inference optimization are welcomed.

Licensing & Compatibility

  • Licensed under the Apache 2.0 License.
  • Pretrained model checkpoints are available for commercial use.

Limitations & Caveats

Latency is noted as an ongoing issue. The project is actively training larger models, suggesting potential for breaking changes or API shifts in future releases.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.