LLaSA_training  by zhenye234

Speech synthesis research paper using LLaMA

created 6 months ago
595 stars

Top 55.5% on sourcepulse

GitHubView on GitHub
Project Summary

LLaSA is a framework for speech synthesis that scales both training and inference compute for LLaMA-based models. It targets researchers and developers working on large-scale, multilingual text-to-speech systems, offering a unified approach to handle both text and speech tokens.

How It Works

LLaSA employs a unified tokenizer that combines text tokens from Llama models with specialized speech tokens derived from X-codec2. This approach allows for end-to-end training of speech synthesis models, enabling efficient scaling of compute resources for both training and inference.

Quick Start & Requirements

  • Install/Run: torchrun --nproc_per_node=8 train_tts.py config.json or sbatch run_slurm.sh
  • Prerequisites: Python, PyTorch, Hugging Face Codec (xcodec2), Llama tokenizer. Requires significant computational resources for training.
  • Data: Open-source datasets (LibriHeavy, Emilia, WenetSpeech4TTS) totaling 160,000 hours are available. Models are trained on 250,000 hours, including 90,000 hours of internal data.

Highlighted Details

  • Supports multilingual speech synthesis (Chinese, English, Japanese, Korean).
  • Offers LLaSA 1B models, including multilingual and finetuned versions.
  • Utilizes X-codec2 for speech tokenization.
  • Paper released (2025-02-07).

Maintenance & Community

  • Recent updates include finetune instructions and multilingual model releases.
  • No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not specify a license. Compatibility for commercial or closed-source use is undetermined.

Limitations & Caveats

The project relies on internal datasets not available for public release, which may limit reproducibility for users without access to similar proprietary data. The absence of a specified license raises concerns about commercial use.

Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
48 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Lianmin Zheng Lianmin Zheng(Author of SGLang).

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
created 1 year ago
updated 1 week ago
Feedback? Help us improve.