StreamSpeech  by ictnlp

All-in-one model for simultaneous speech tasks (ACL 2024 paper)

created 1 year ago
1,123 stars

Top 34.8% on sourcepulse

GitHubView on GitHub
Project Summary

StreamSpeech is an "All in One" model for offline and simultaneous speech recognition, speech translation, and speech synthesis. It targets researchers and developers working on low-latency, multi-task speech processing, offering state-of-the-art performance and intermediate result streaming.

How It Works

StreamSpeech employs a multi-task learning approach within a unified architecture to handle ASR, S2TT, and S2ST tasks simultaneously. This design allows for seamless integration and efficient processing, enabling the model to output intermediate ASR or translation results during real-time translation, thereby enhancing communication experiences.

Quick Start & Requirements

  • Install: Requires fairseq and SimulEval to be installed from their respective directories (pip install --editable ./ --no-build-isolation).
  • Prerequisites: Python 3.10, PyTorch 2.0.1, CUDA-enabled GPU.
  • Models: Download pre-trained models for specific language pairs (e.g., Fr-En, Es-En, De-En) and a Unit-based HiFi-GAN vocoder.
  • Data: Prepare test data following the SimulEval format, including wav_list.txt and target.txt. Configuration files need to be updated with local repository paths.
  • Inference: Run provided simuleval scripts for simultaneous S2ST, S2TT, and streaming ASR.
  • Demo: A Web GUI demo is available.

Highlighted Details

  • Supports 8 tasks: Offline ASR, S2TT, S2ST, TTS; Simultaneous Streaming ASR, S2TT, S2ST, Real-time TTS.
  • Achieves SOTA performance on both offline and simultaneous speech-to-speech translation.
  • Can present intermediate ASR and translation results during simultaneous translation.
  • Offers a Web GUI demo for local browser experience.

Maintenance & Community

The project is associated with ACL 2024. Contact zhangshaolei20z@ict.ac.cn for questions.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions paths like /data/zhangshaolei/StreamSpeech and /data/zhangshaolei/pretrain_models, indicating that users will need to adjust these paths to their local environments. Training scripts are provided, but the primary focus is on inference with pre-trained models.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
61 stars in the last 90 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

GPT-SoVITS by RVC-Boss

0.5%
49k
Few-shot voice cloning and TTS web UI
created 1 year ago
updated 1 day ago
Feedback? Help us improve.