speech-to-speech  by huggingface

Speech-to-speech pipeline for open-sourced, modular GPT4-o

created 1 year ago
4,123 stars

Top 12.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a modular, open-source speech-to-speech pipeline, aiming to replicate GPT-4o's functionality. It's designed for researchers and developers needing a flexible system for voice conversion, translation, or summarization tasks, offering significant customization through Hugging Face Transformers models.

How It Works

The system operates as a cascaded pipeline: Voice Activity Detection (VAD), Speech-to-Text (STT), a Language Model (LM) for processing/generation, and Text-to-Speech (TTS). This modular design allows users to swap components easily, leveraging a wide array of models from the Hugging Face Hub, including Whisper variants for STT, various instruction-following models for the LM, and Parler-TTS or MeloTTS for TTS.

Quick Start & Requirements

  • Install: uv pip install -r requirements.txt (or requirements_mac.txt for macOS).
  • Prerequisites: Python, uv package manager. For Melo TTS: python -m unidic download. CUDA is recommended for optimal performance.
  • Usage: Run via python s2s_pipeline.py (server/client or local modes) or python listen_and_play.py (client). Docker support is available.
  • Docs: https://github.com/huggingface/speech-to-speech

Highlighted Details

  • Supports multiple STT (Whisper, Paraformer, FunASR), LM, and TTS (Parler-TTS, MeloTTS, ChatTTS) backends.
  • Offers server/client and local execution modes.
  • Includes automatic language detection and switching capabilities.
  • Supports multi-language conversations (English, French, Spanish, Chinese, Japanese, Korean).
  • Recommends Torch Compile for performance with CUDA.

Maintenance & Community

This project is actively developed by Hugging Face. Community channels are typically found via Hugging Face's main platforms.

Licensing & Compatibility

The repository itself appears to be under a permissive license, but individual models used within the pipeline will have their own licenses. Compatibility for commercial use depends on the chosen model licenses.

Limitations & Caveats

CUDA Graph capture modes are not compatible with streaming Parler-TTS. The README notes that Parler-TTS is not yet multilingual, requiring alternative TTS models for non-English languages in the automatic language switching mode.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
141 stars in the last 90 days

Explore Similar Projects

Starred by Michael Han Michael Han(Cofounder of Unsloth), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

TTS by coqui-ai

0.4%
42k
Deep learning toolkit for Text-to-Speech, research-tested
created 5 years ago
updated 11 months ago
Feedback? Help us improve.