tts by inworld-ai

TTS training framework for SpeechLM models

Created 8 months ago

604 stars

Top 54.2% on SourcePulse

Project Summary

This repository provides the training and modeling code for Inworld's SpeechLM-based Text-To-Speech (TTS) models, enabling users to pre-train, fine-tune, or align their own TTS models. It supports single or multi-GPU setups and is designed for researchers and developers working with advanced speech synthesis.

How It Works

The system leverages SpeechLM and 1D audio-codecs for TTS generation. It supports distributed training via DDP, DeepSpeed, and FSDP, offering flexibility for various hardware configurations. A robust data pipeline is included for preparing audio data into audio-codes, which are then used to condition the model for speech generation.

Quick Start & Requirements

Installation: make install (with optional CUDA_VERSION argument).
Prerequisites: Python 3.10, CUDA 12.4 or 12.8, PyTorch 2.6/2.7. uv is recommended for package management.
Setup: The make install command automates virtual environment creation, PyTorch installation with flash attention, and dependency setup.
Documentation: Inworld TTS Playground Examples, Technical Report

Highlighted Details

Supports SpeechLM and 1D audio-codecs.
Distributed training with DDP, DeepSpeed, and FSDP.
Includes data preparation and vectorization scripts.
Offers example data and configuration for testing.
Provides an inference script for generating speech from text and audio prompts.

Maintenance & Community

Contributions are welcomed via pull requests.
Bug reports should be filed as GitHub Issues.
General inquiries can be directed via email.
Acknowledgments include Meta AI for LLaMA LLMs and the PyTorch/Hugging Face communities.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The code is only tested on Ubuntu 22.04. Training requires significant computational resources and a prepared dataset in a specific JSONL format. Inference requires multiple model checkpoints (trained model, audio encoder, audio decoder).

tts by inworld-ai

Explore Similar Projects

SpeechGPT-2.0-preview by OpenMOSS

csm-mlx by senstella

UniAudio by yangdongchao

VITA-Audio by VITA-MLLM

zamia-speech by gooofy

soundstorm-pytorch by lucidrains

fish-diffusion by fishaudio

WavTokenizer by jishengpeng

athena by athena-team

audiolm-pytorch by lucidrains

Kimi-Audio by MoonshotAI

ultravox by fixie-ai