All-in-one model for simultaneous speech tasks (ACL 2024 paper)
Top 34.8% on sourcepulse
StreamSpeech is an "All in One" model for offline and simultaneous speech recognition, speech translation, and speech synthesis. It targets researchers and developers working on low-latency, multi-task speech processing, offering state-of-the-art performance and intermediate result streaming.
How It Works
StreamSpeech employs a multi-task learning approach within a unified architecture to handle ASR, S2TT, and S2ST tasks simultaneously. This design allows for seamless integration and efficient processing, enabling the model to output intermediate ASR or translation results during real-time translation, thereby enhancing communication experiences.
Quick Start & Requirements
fairseq
and SimulEval
to be installed from their respective directories (pip install --editable ./ --no-build-isolation
).wav_list.txt
and target.txt
. Configuration files need to be updated with local repository paths.simuleval
scripts for simultaneous S2ST, S2TT, and streaming ASR.Highlighted Details
Maintenance & Community
The project is associated with ACL 2024. Contact zhangshaolei20z@ict.ac.cn
for questions.
Licensing & Compatibility
The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README mentions paths like /data/zhangshaolei/StreamSpeech
and /data/zhangshaolei/pretrain_models
, indicating that users will need to adjust these paths to their local environments. Training scripts are provided, but the primary focus is on inference with pre-trained models.
1 month ago
1 day