SimulStreaming  by ufal

Real-time speech-to-text and LLM translation

Created 9 months ago
498 stars

Top 62.4% on SourcePulse

GitHubView on GitHub
Project Summary

SimulStreaming provides a framework for real-time, low-latency speech-to-text (ASR) and text-to-text translation, specifically designed for long-form speech. It targets researchers and power users requiring efficient, multilingual audio stream processing, offering significant speed improvements and enabling production-ready applications by adapting offline models for streaming.

How It Works

The system integrates a Whisper-based ASR component with an LLM-based translation component (currently EuroLLM). It employs novel "simultaneous policies," such as AlignAtt and LocalAgreement, to adapt powerful offline models for streaming input. These policies intelligently manage input chunking and output generation, allowing high-quality foundation models to operate with minimal performance degradation, achieving near real-time processing. The architecture supports direct transcription or a cascade of ASR followed by translation, incorporating flexible prompting and retrieval augmented generation (RAG).

Quick Start & Requirements

  • Installation:
    • ASR: pip install -r requirements_whisper.txt
    • Translation: pip install -r requirements_translate.txt
  • Prerequisites:
    • Python environment.
    • GPU with at least 10GB VRAM recommended for optimal performance with Whisper large-v3 (1.5B parameters); CPU is functional but slow.
    • For translation: Access and download gated LLM models (e.g., EuroLLM-9B-Instruct) from Hugging Face, then convert them to CTranslate2 format using ct2-transformers-converter.
  • Setup: Requires careful configuration of model paths and dependencies. Real-time server/client setup involves tools like arecord (Linux) or ffmpeg and netcat.
  • Links:
    • Paper: Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025
    • Poster: IWSLT 2025
    • Demo info: ELITR.eu/iwslt25

Highlighted Details

  • Multilingual Capabilities: Supports transcription from 99 source languages and translation into 35 target languages.
  • Performance: Achieves approximately 5x faster real-time processing than WhisperStreaming and was recognized as state-of-the-art in the IWSLT 2025 Simultaneous Speech Translation Shared Task.
  • Advanced Features: Integrates flexible prompting, retrieval augmented generation (RAG), and support for injecting in-domain terminology.
  • Hardware Efficiency: Optimized for 1–2 GPUs, with provisions for smaller distilled models.

Maintenance & Community

Developed by authors from Charles University. User feedback is actively sought via a questionnaire to guide future development and features. No specific community channels (e.g., Discord, Slack) or public roadmaps are detailed in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: The permissive MIT license allows for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

  • An offline mode for processing entire audio files with maximum quality is not yet available.
  • The CIF model required for accurate word-level truncation with Whisper large-v3 is not provided, potentially affecting the precision of the final word in a segment.
  • Setting up the LLM translation component necessitates obtaining access to gated models and performing a conversion step, adding complexity to the initial setup.
Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
4
Star History
45 stars in the last 30 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
43 more.

whisper by openai

0.3%
95k
Speech recognition model for multilingual transcription/translation
Created 3 years ago
Updated 2 months ago
Feedback? Help us improve.