Speech interaction system integrating ASR, LLM, and TTS
Top 41.9% on sourcepulse
This project provides a sequential speech interaction system integrating Automatic Speech Recognition (ASR), Large Language Models (LLM), and Text-to-Speech (TTS). It targets developers and researchers building voice-enabled applications, offering a modular framework with multiple model options for flexibility.
How It Works
The system chains together distinct open-source models: SenseVoice for ASR, Qwen2.5 variants for LLM, and CosyVoice, Edge-TTS, or pyttsx3 for TTS. This modular approach allows users to select and swap components based on performance, resource, and quality requirements. Recent updates include voiceprint recognition using CAM++, custom wake-word detection via pinyin matching, and dialogue history memory.
Quick Start & Requirements
conda create -n chatAudio python=3.10
), activate it (conda activate chatAudio
), and install PyTorch with CUDA support (e.g., pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
). Install other dependencies via pip install -r requirements.txt
or specific packages as listed in the README.python 13_SenceVoice_QWen2.5_edgeTTS_realTime.py
.pynini
may require conda install -c conda-forge pynini=2.1.6
.Highlighted Details
Maintenance & Community
The project is actively updated, with recent additions in late 2024. Links to Bilibili demos are provided for visual examples.
Licensing & Compatibility
The README does not explicitly state the license for the project itself or its constituent models. Users should verify licensing for each component (SenseVoice, Qwen2.5, CosyVoice, Edge-TTS, pyttsx3) before commercial use.
Limitations & Caveats
CosyVoice is noted as having slow inference speeds, impacting real-time performance. Some dependencies, like pynini
, may require specific installation methods (e.g., using conda). The project's overall licensing status requires clarification for commercial applications.
5 months ago
1+ week