speech-to-speech  by huggingface

Speech-to-speech pipeline for open-sourced, modular GPT4-o

Created 1 year ago
4,182 stars

Top 11.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a modular, open-source speech-to-speech pipeline, aiming to replicate GPT-4o's functionality. It's designed for researchers and developers needing a flexible system for voice conversion, translation, or summarization tasks, offering significant customization through Hugging Face Transformers models.

How It Works

The system operates as a cascaded pipeline: Voice Activity Detection (VAD), Speech-to-Text (STT), a Language Model (LM) for processing/generation, and Text-to-Speech (TTS). This modular design allows users to swap components easily, leveraging a wide array of models from the Hugging Face Hub, including Whisper variants for STT, various instruction-following models for the LM, and Parler-TTS or MeloTTS for TTS.

Quick Start & Requirements

  • Install: uv pip install -r requirements.txt (or requirements_mac.txt for macOS).
  • Prerequisites: Python, uv package manager. For Melo TTS: python -m unidic download. CUDA is recommended for optimal performance.
  • Usage: Run via python s2s_pipeline.py (server/client or local modes) or python listen_and_play.py (client). Docker support is available.
  • Docs: https://github.com/huggingface/speech-to-speech

Highlighted Details

  • Supports multiple STT (Whisper, Paraformer, FunASR), LM, and TTS (Parler-TTS, MeloTTS, ChatTTS) backends.
  • Offers server/client and local execution modes.
  • Includes automatic language detection and switching capabilities.
  • Supports multi-language conversations (English, French, Spanish, Chinese, Japanese, Korean).
  • Recommends Torch Compile for performance with CUDA.

Maintenance & Community

This project is actively developed by Hugging Face. Community channels are typically found via Hugging Face's main platforms.

Licensing & Compatibility

The repository itself appears to be under a permissive license, but individual models used within the pipeline will have their own licenses. Compatibility for commercial use depends on the chosen model licenses.

Limitations & Caveats

CUDA Graph capture modes are not compatible with streaming Parler-TTS. The README notes that Parler-TTS is not yet multilingual, requiring alternative TTS models for non-English languages in the automatic language switching mode.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
43 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.