tts  by inworld-ai

TTS training framework for SpeechLM models

Created 4 months ago
491 stars

Top 62.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the training and modeling code for Inworld's SpeechLM-based Text-To-Speech (TTS) models, enabling users to pre-train, fine-tune, or align their own TTS models. It supports single or multi-GPU setups and is designed for researchers and developers working with advanced speech synthesis.

How It Works

The system leverages SpeechLM and 1D audio-codecs for TTS generation. It supports distributed training via DDP, DeepSpeed, and FSDP, offering flexibility for various hardware configurations. A robust data pipeline is included for preparing audio data into audio-codes, which are then used to condition the model for speech generation.

Quick Start & Requirements

  • Installation: make install (with optional CUDA_VERSION argument).
  • Prerequisites: Python 3.10, CUDA 12.4 or 12.8, PyTorch 2.6/2.7. uv is recommended for package management.
  • Setup: The make install command automates virtual environment creation, PyTorch installation with flash attention, and dependency setup.
  • Documentation: Inworld TTS Playground Examples, Technical Report

Highlighted Details

  • Supports SpeechLM and 1D audio-codecs.
  • Distributed training with DDP, DeepSpeed, and FSDP.
  • Includes data preparation and vectorization scripts.
  • Offers example data and configuration for testing.
  • Provides an inference script for generating speech from text and audio prompts.

Maintenance & Community

  • Contributions are welcomed via pull requests.
  • Bug reports should be filed as GitHub Issues.
  • General inquiries can be directed via email.
  • Acknowledgments include Meta AI for LLaMA LLMs and the PyTorch/Hugging Face communities.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The code is only tested on Ubuntu 22.04. Training requires significant computational resources and a prepared dataset in a specific JSONL format. Inference requires multiple model checkpoints (trained model, audio encoder, audio decoder).

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
37 stars in the last 30 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

ultravox by fixie-ai

0.2%
4k
Multimodal LLM for real-time voice interactions
Created 1 year ago
Updated 2 weeks ago
Feedback? Help us improve.