Speech-to-speech pipeline for open-sourced, modular GPT4-o
Top 12.1% on sourcepulse
This repository provides a modular, open-source speech-to-speech pipeline, aiming to replicate GPT-4o's functionality. It's designed for researchers and developers needing a flexible system for voice conversion, translation, or summarization tasks, offering significant customization through Hugging Face Transformers models.
How It Works
The system operates as a cascaded pipeline: Voice Activity Detection (VAD), Speech-to-Text (STT), a Language Model (LM) for processing/generation, and Text-to-Speech (TTS). This modular design allows users to swap components easily, leveraging a wide array of models from the Hugging Face Hub, including Whisper variants for STT, various instruction-following models for the LM, and Parler-TTS or MeloTTS for TTS.
Quick Start & Requirements
uv pip install -r requirements.txt
(or requirements_mac.txt
for macOS).uv
package manager. For Melo TTS: python -m unidic download
. CUDA is recommended for optimal performance.python s2s_pipeline.py
(server/client or local modes) or python listen_and_play.py
(client). Docker support is available.Highlighted Details
Maintenance & Community
This project is actively developed by Hugging Face. Community channels are typically found via Hugging Face's main platforms.
Licensing & Compatibility
The repository itself appears to be under a permissive license, but individual models used within the pipeline will have their own licenses. Compatibility for commercial use depends on the chosen model licenses.
Limitations & Caveats
CUDA Graph capture modes are not compatible with streaming Parler-TTS. The README notes that Parler-TTS is not yet multilingual, requiring alternative TTS models for non-English languages in the automatic language switching mode.
3 months ago
1 day