RSTnet by yangdongchao

Real-time speech-text foundation model toolkit

Created 1 year ago

254 stars

Top 99.1% on SourcePulse

Project Summary

Summary RSTnet tackles the challenges in training real-time speech-text foundation models, providing an open-source toolkit for researchers. It offers a comprehensive framework for data processing, audio codec integration, and model pre-training/fine-tuning, aiming to accelerate the development of advanced speech-text systems.

How It Works The platform integrates data preparation, streaming audio codec models (MimiCodec reproduced), and speech-text foundation models. Building on Moshi and UniAudio, it supports diverse LLM backbones (LLAMA, Gemma, Mistral, Phi, StableLM, Qwen) and enables efficient fine-tuning via Lora or full training, facilitating rapid research.

Quick Start & Requirements Installation requires Python 3.12 and PyTorch with CUDA 12.1 support, alongside packages like tqdm, librosa==0.9.1, matplotlib, omegaconf, einops, vector_quantize_pytorch, tensorboard, deepspeed, and peft. GPU acceleration is implied, with significant resources potentially needed for pre-training. A technical report is available at https://github.com/yangdongchao/RSTnet/blob/main/RSTnet.pdf.

Highlighted Details

Supports pre-training of speech-text foundation models (MLLM_v2).
Accommodates diverse LLM backbones: LLAMA, Gemma, Mistral, Phi, StableLM, Qwen.
Offers Lora fine-tuning for reduced GPU resource consumption.
Reproduces the MimiCodec audio codec.
Leverages and builds upon the Moshi and UniAudio codebases.

Maintenance & Community Marked as a work-in-progress ("wip"), the project actively seeks contributions via issues, PRs, and ideas for data, codecs, or models. Contact: dcyang@se.cuhk.edu.hk. No specific community channels are listed.

Licensing & Compatibility The license type is not specified in the README. Compatibility for commercial use or closed-source linking cannot be determined.

Limitations & Caveats RSTnet is a work-in-progress with ongoing updates. DataPipeline details are pending. Support for additional streaming audio codecs is planned.

RSTnet by yangdongchao

Explore Similar Projects

SpeechGPT-2.0-preview by OpenMOSS

LLaSA_training by zhenye234

VITA-Audio by VITA-MLLM

tts by inworld-ai

fast-voice-assistant by dsa

dia2 by nari-labs

Kimi-Audio by MoonshotAI

ultravox by fixie-ai

Orpheus-TTS by canopyai

parler-tts by huggingface

Kokoro-FastAPI by remsky

Zonos by Zyphra