LLaSA_training by zhenye234

Speech synthesis research paper using LLaMA

Created 1 year ago

644 stars

Top 51.8% on SourcePulse

Project Summary

LLaSA is a framework for speech synthesis that scales both training and inference compute for LLaMA-based models. It targets researchers and developers working on large-scale, multilingual text-to-speech systems, offering a unified approach to handle both text and speech tokens.

How It Works

LLaSA employs a unified tokenizer that combines text tokens from Llama models with specialized speech tokens derived from X-codec2. This approach allows for end-to-end training of speech synthesis models, enabling efficient scaling of compute resources for both training and inference.

Quick Start & Requirements

Install/Run: torchrun --nproc_per_node=8 train_tts.py config.json or sbatch run_slurm.sh
Prerequisites: Python, PyTorch, Hugging Face Codec (xcodec2), Llama tokenizer. Requires significant computational resources for training.
Data: Open-source datasets (LibriHeavy, Emilia, WenetSpeech4TTS) totaling 160,000 hours are available. Models are trained on 250,000 hours, including 90,000 hours of internal data.

Highlighted Details

Supports multilingual speech synthesis (Chinese, English, Japanese, Korean).
Offers LLaSA 1B models, including multilingual and finetuned versions.
Utilizes X-codec2 for speech tokenization.
Paper released (2025-02-07).

Maintenance & Community

Recent updates include finetune instructions and multilingual model releases.
No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial or closed-source use is undetermined.

Limitations & Caveats

The project relies on internal datasets not available for public release, which may limit reproducibility for users without access to similar proprietary data. The absence of a specified license raises concerns about commercial use.

LLaSA_training by zhenye234

Explore Similar Projects

praises by ElmTran

speech-recognition-uk by egorsmkv

local_llm_assistant by nickbild

echogarden by echogarden-project

LLaSM by LinkSoul-AI

FireRedTTS by FireRedTeam

fast-voice-assistant by dsa

ichigo by janhq

local-talking-llm by vndee

Orpheus-TTS by canopyai

Zonos by Zyphra

seamless_communication by facebookresearch