Fine-tune audio models with LoRA
Top 96.5% on sourcepulse
Moshi-Finetune offers a streamlined workflow for fine-tuning Moshi speech models using LoRA, enabling users to adapt pre-trained models to custom conversational datasets. It targets researchers and developers looking to create personalized voice assistants or specialized audio transcription tools. The primary benefit is efficient, lightweight model adaptation without requiring extensive computational resources for full fine-tuning.
How It Works
The project leverages LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning, significantly reducing the number of trainable parameters. It processes stereo audio, using the left channel for generated audio and the right for user input, with associated JSON files containing timestamped transcripts. Training is orchestrated via YAML configuration files, allowing customization of hyperparameters, dataset paths, and LoRA settings.
Quick Start & Requirements
uv run pip install -e .
or pip install -e .
(Python 3.10+ recommended).git clone git@github.com:kyutai-labs/moshi-finetune.git
.jsonl
index. A sample 14GB dataset can be downloaded via snapshot_download("kyutai/DailyTalkContiguous")
.torchrun --nproc-per-node <N> -m train <config_file.yaml>
.moshi
package and running python -m moshi.server
.Highlighted Details
annotate.py
script for generating JSON transcripts, with SLURM support for distributed annotation.Maintenance & Community
mistral-finetune
(Apache License 2.0).Licensing & Compatibility
mistral-finetune
, which is licensed under Apache License 2.0.Limitations & Caveats
The README does not specify the license for the moshi-finetune
repository itself, which could impact commercial adoption. There is also no mention of community support channels or a public roadmap.
3 weeks ago
Inactive