StreamSpeech  by ictnlp

All-in-one model for simultaneous speech tasks (ACL 2024 paper)

Created 1 year ago
1,152 stars

Top 33.5% on SourcePulse

GitHubView on GitHub
Project Summary

StreamSpeech is an "All in One" model for offline and simultaneous speech recognition, speech translation, and speech synthesis. It targets researchers and developers working on low-latency, multi-task speech processing, offering state-of-the-art performance and intermediate result streaming.

How It Works

StreamSpeech employs a multi-task learning approach within a unified architecture to handle ASR, S2TT, and S2ST tasks simultaneously. This design allows for seamless integration and efficient processing, enabling the model to output intermediate ASR or translation results during real-time translation, thereby enhancing communication experiences.

Quick Start & Requirements

  • Install: Requires fairseq and SimulEval to be installed from their respective directories (pip install --editable ./ --no-build-isolation).
  • Prerequisites: Python 3.10, PyTorch 2.0.1, CUDA-enabled GPU.
  • Models: Download pre-trained models for specific language pairs (e.g., Fr-En, Es-En, De-En) and a Unit-based HiFi-GAN vocoder.
  • Data: Prepare test data following the SimulEval format, including wav_list.txt and target.txt. Configuration files need to be updated with local repository paths.
  • Inference: Run provided simuleval scripts for simultaneous S2ST, S2TT, and streaming ASR.
  • Demo: A Web GUI demo is available.

Highlighted Details

  • Supports 8 tasks: Offline ASR, S2TT, S2ST, TTS; Simultaneous Streaming ASR, S2TT, S2ST, Real-time TTS.
  • Achieves SOTA performance on both offline and simultaneous speech-to-speech translation.
  • Can present intermediate ASR and translation results during simultaneous translation.
  • Offers a Web GUI demo for local browser experience.

Maintenance & Community

The project is associated with ACL 2024. Contact zhangshaolei20z@ict.ac.cn for questions.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions paths like /data/zhangshaolei/StreamSpeech and /data/zhangshaolei/pretrain_models, indicating that users will need to adjust these paths to their local environments. Training scripts are provided, but the primary focus is on inference with pre-trained models.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.