Discover and explore top open-source AI tools and projects—updated daily.
yangdongchaoReal-time speech-text foundation model toolkit
Top 99.9% on SourcePulse
Summary RSTnet tackles the challenges in training real-time speech-text foundation models, providing an open-source toolkit for researchers. It offers a comprehensive framework for data processing, audio codec integration, and model pre-training/fine-tuning, aiming to accelerate the development of advanced speech-text systems.
How It Works The platform integrates data preparation, streaming audio codec models (MimiCodec reproduced), and speech-text foundation models. Building on Moshi and UniAudio, it supports diverse LLM backbones (LLAMA, Gemma, Mistral, Phi, StableLM, Qwen) and enables efficient fine-tuning via Lora or full training, facilitating rapid research.
Quick Start & Requirements
Installation requires Python 3.12 and PyTorch with CUDA 12.1 support, alongside packages like tqdm, librosa==0.9.1, matplotlib, omegaconf, einops, vector_quantize_pytorch, tensorboard, deepspeed, and peft. GPU acceleration is implied, with significant resources potentially needed for pre-training. A technical report is available at https://github.com/yangdongchao/RSTnet/blob/main/RSTnet.pdf.
Highlighted Details
Maintenance & Community
Marked as a work-in-progress ("wip"), the project actively seeks contributions via issues, PRs, and ideas for data, codecs, or models. Contact: dcyang@se.cuhk.edu.hk. No specific community channels are listed.
Licensing & Compatibility The license type is not specified in the README. Compatibility for commercial use or closed-source linking cannot be determined.
Limitations & Caveats RSTnet is a work-in-progress with ongoing updates. DataPipeline details are pending. Support for additional streaming audio codecs is planned.
10 months ago
Inactive
fixie-ai
canopyai