StreamSpeech by ictnlp

All-in-one model for simultaneous speech tasks (ACL 2024 paper)

Created 1 year ago

1,223 stars

Top 32.1% on SourcePulse

Project Summary

StreamSpeech is an "All in One" model for offline and simultaneous speech recognition, speech translation, and speech synthesis. It targets researchers and developers working on low-latency, multi-task speech processing, offering state-of-the-art performance and intermediate result streaming.

How It Works

StreamSpeech employs a multi-task learning approach within a unified architecture to handle ASR, S2TT, and S2ST tasks simultaneously. This design allows for seamless integration and efficient processing, enabling the model to output intermediate ASR or translation results during real-time translation, thereby enhancing communication experiences.

Quick Start & Requirements

Install: Requires fairseq and SimulEval to be installed from their respective directories (pip install --editable ./ --no-build-isolation).
Prerequisites: Python 3.10, PyTorch 2.0.1, CUDA-enabled GPU.
Models: Download pre-trained models for specific language pairs (e.g., Fr-En, Es-En, De-En) and a Unit-based HiFi-GAN vocoder.
Data: Prepare test data following the SimulEval format, including wav_list.txt and target.txt. Configuration files need to be updated with local repository paths.
Inference: Run provided simuleval scripts for simultaneous S2ST, S2TT, and streaming ASR.
Demo: A Web GUI demo is available.

Highlighted Details

Supports 8 tasks: Offline ASR, S2TT, S2ST, TTS; Simultaneous Streaming ASR, S2TT, S2ST, Real-time TTS.
Achieves SOTA performance on both offline and simultaneous speech-to-speech translation.
Can present intermediate ASR and translation results during simultaneous translation.
Offers a Web GUI demo for local browser experience.

Maintenance & Community

The project is associated with ACL 2024. Contact zhangshaolei20z@ict.ac.cn for questions.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions paths like /data/zhangshaolei/StreamSpeech and /data/zhangshaolei/pretrain_models, indicating that users will need to adjust these paths to their local environments. Training scripts are provided, but the primary focus is on inference with pre-trained models.

StreamSpeech by ictnlp

Explore Similar Projects

LLaMA-Omni2 by ictnlp

S.A.T.U.R.D.A.Y by GRVYDEV

babelfish.ai by supabase-community

hibiki by kyutai-labs

xtts-webui by daswer123

sherpa-ncnn by k2-fsa

voice-pro by abus-aikorea

whisper-asr-webservice by ahmetoner

piper by rhasspy

seamless_communication by facebookresearch

wenet by wenet-e2e

CosyVoice by FunAudioLLM