RSTnet  by yangdongchao

Real-time speech-text foundation model toolkit

Created 1 year ago
251 stars

Top 99.9% on SourcePulse

GitHubView on GitHub
Project Summary

Summary RSTnet tackles the challenges in training real-time speech-text foundation models, providing an open-source toolkit for researchers. It offers a comprehensive framework for data processing, audio codec integration, and model pre-training/fine-tuning, aiming to accelerate the development of advanced speech-text systems.

How It Works The platform integrates data preparation, streaming audio codec models (MimiCodec reproduced), and speech-text foundation models. Building on Moshi and UniAudio, it supports diverse LLM backbones (LLAMA, Gemma, Mistral, Phi, StableLM, Qwen) and enables efficient fine-tuning via Lora or full training, facilitating rapid research.

Quick Start & Requirements Installation requires Python 3.12 and PyTorch with CUDA 12.1 support, alongside packages like tqdm, librosa==0.9.1, matplotlib, omegaconf, einops, vector_quantize_pytorch, tensorboard, deepspeed, and peft. GPU acceleration is implied, with significant resources potentially needed for pre-training. A technical report is available at https://github.com/yangdongchao/RSTnet/blob/main/RSTnet.pdf.

Highlighted Details

  • Supports pre-training of speech-text foundation models (MLLM_v2).
  • Accommodates diverse LLM backbones: LLAMA, Gemma, Mistral, Phi, StableLM, Qwen.
  • Offers Lora fine-tuning for reduced GPU resource consumption.
  • Reproduces the MimiCodec audio codec.
  • Leverages and builds upon the Moshi and UniAudio codebases.

Maintenance & Community Marked as a work-in-progress ("wip"), the project actively seeks contributions via issues, PRs, and ideas for data, codecs, or models. Contact: dcyang@se.cuhk.edu.hk. No specific community channels are listed.

Licensing & Compatibility The license type is not specified in the README. Compatibility for commercial use or closed-source linking cannot be determined.

Limitations & Caveats RSTnet is a work-in-progress with ongoing updates. DataPipeline details are pending. Support for additional streaming audio codecs is planned.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

ultravox by fixie-ai

0.1%
4k
Multimodal LLM for real-time voice interactions
Created 1 year ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.3%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 10 months ago
Updated 1 month ago
Feedback? Help us improve.