SimulStreaming by ufal

Real-time speech-to-text and LLM translation

Created 1 year ago

636 stars

Top 51.4% on SourcePulse

Project Summary

SimulStreaming provides a framework for real-time, low-latency speech-to-text (ASR) and text-to-text translation, specifically designed for long-form speech. It targets researchers and power users requiring efficient, multilingual audio stream processing, offering significant speed improvements and enabling production-ready applications by adapting offline models for streaming.

How It Works

The system integrates a Whisper-based ASR component with an LLM-based translation component (currently EuroLLM). It employs novel "simultaneous policies," such as AlignAtt and LocalAgreement, to adapt powerful offline models for streaming input. These policies intelligently manage input chunking and output generation, allowing high-quality foundation models to operate with minimal performance degradation, achieving near real-time processing. The architecture supports direct transcription or a cascade of ASR followed by translation, incorporating flexible prompting and retrieval augmented generation (RAG).

Quick Start & Requirements

Installation:
- ASR: pip install -r requirements_whisper.txt
- Translation: pip install -r requirements_translate.txt
Prerequisites:
- Python environment.
- GPU with at least 10GB VRAM recommended for optimal performance with Whisper large-v3 (1.5B parameters); CPU is functional but slow.
- For translation: Access and download gated LLM models (e.g., EuroLLM-9B-Instruct) from Hugging Face, then convert them to CTranslate2 format using ct2-transformers-converter.
Setup: Requires careful configuration of model paths and dependencies. Real-time server/client setup involves tools like arecord (Linux) or ffmpeg and netcat.
Links:
- Paper: Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025
- Poster: IWSLT 2025
- Demo info: ELITR.eu/iwslt25

Highlighted Details

Multilingual Capabilities: Supports transcription from 99 source languages and translation into 35 target languages.
Performance: Achieves approximately 5x faster real-time processing than WhisperStreaming and was recognized as state-of-the-art in the IWSLT 2025 Simultaneous Speech Translation Shared Task.
Advanced Features: Integrates flexible prompting, retrieval augmented generation (RAG), and support for injecting in-domain terminology.
Hardware Efficiency: Optimized for 1–2 GPUs, with provisions for smaller distilled models.

Maintenance & Community

Developed by authors from Charles University. User feedback is actively sought via a questionnaire to guide future development and features. No specific community channels (e.g., Discord, Slack) or public roadmaps are detailed in the README.

SimulStreaming by ufal

Explore Similar Projects

LLaMA-Omni2 by ictnlp

babelfish.ai by supabase-community

pytvzhen by CuSO4Gem

Modelscope_Faster_Whisper_Multi_Subtitle by v3ucn

Speech-Translate by Dadangdut33

hibiki by kyutai-labs

obs-localvocal by royshil

ichigo by janhq

Qwen3-ASR by QwenLM

seamless_communication by facebookresearch

VideoCaptioner by WEIFENG2333

whisper by openai