speech-to-speech by huggingface

Speech-to-speech pipeline for open-sourced, modular GPT4-o

Created 1 year ago

4,269 stars

Top 11.4% on SourcePulse

View on GitHub

5 Experts Love This Project

Clement Delangue

Cofounder of Hugging Face

Lysandre Debut

Chief Open-Source Officer at Hugging Face

Leandro von Werra

Head of Research at Hugging Face

Omar Sanseviero

DevRel at Google DeepMind

and 1 more!

Project Summary

This repository provides a modular, open-source speech-to-speech pipeline, aiming to replicate GPT-4o's functionality. It's designed for researchers and developers needing a flexible system for voice conversion, translation, or summarization tasks, offering significant customization through Hugging Face Transformers models.

How It Works

The system operates as a cascaded pipeline: Voice Activity Detection (VAD), Speech-to-Text (STT), a Language Model (LM) for processing/generation, and Text-to-Speech (TTS). This modular design allows users to swap components easily, leveraging a wide array of models from the Hugging Face Hub, including Whisper variants for STT, various instruction-following models for the LM, and Parler-TTS or MeloTTS for TTS.

Quick Start & Requirements

Install: uv pip install -r requirements.txt (or requirements_mac.txt for macOS).
Prerequisites: Python, uv package manager. For Melo TTS: python -m unidic download. CUDA is recommended for optimal performance.
Usage: Run via python s2s_pipeline.py (server/client or local modes) or python listen_and_play.py (client). Docker support is available.
Docs: https://github.com/huggingface/speech-to-speech

Highlighted Details

Supports multiple STT (Whisper, Paraformer, FunASR), LM, and TTS (Parler-TTS, MeloTTS, ChatTTS) backends.
Offers server/client and local execution modes.
Includes automatic language detection and switching capabilities.
Supports multi-language conversations (English, French, Spanish, Chinese, Japanese, Korean).
Recommends Torch Compile for performance with CUDA.

Maintenance & Community

This project is actively developed by Hugging Face. Community channels are typically found via Hugging Face's main platforms.

Licensing & Compatibility

The repository itself appears to be under a permissive license, but individual models used within the pipeline will have their own licenses. Compatibility for commercial use depends on the chosen model licenses.

Limitations & Caveats

CUDA Graph capture modes are not compatible with streaming Parler-TTS. The README notes that Parler-TTS is not yet multilingual, requiring alternative TTS models for non-English languages in the automatic language switching mode.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

26 stars in the last 30 days