ASR-LLM-TTS  by ABexit

Speech interaction system integrating ASR, LLM, and TTS

created 8 months ago
876 stars

Top 41.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a sequential speech interaction system integrating Automatic Speech Recognition (ASR), Large Language Models (LLM), and Text-to-Speech (TTS). It targets developers and researchers building voice-enabled applications, offering a modular framework with multiple model options for flexibility.

How It Works

The system chains together distinct open-source models: SenseVoice for ASR, Qwen2.5 variants for LLM, and CosyVoice, Edge-TTS, or pyttsx3 for TTS. This modular approach allows users to select and swap components based on performance, resource, and quality requirements. Recent updates include voiceprint recognition using CAM++, custom wake-word detection via pinyin matching, and dialogue history memory.

Quick Start & Requirements

  • Installation: Create a conda environment (conda create -n chatAudio python=3.10), activate it (conda activate chatAudio), and install PyTorch with CUDA support (e.g., pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118). Install other dependencies via pip install -r requirements.txt or specific packages as listed in the README.
  • Prerequisites: Python 3.10+, CUDA 11.8+ (for the specified PyTorch version), ffmpeg. Manual model downloads may be required if automatic download fails (e.g., due to network restrictions).
  • Execution: Run example scripts like python 13_SenceVoice_QWen2.5_edgeTTS_realTime.py.
  • Resources: CosyVoice dependencies like pynini may require conda install -c conda-forge pynini=2.1.6.
  • Links: Demo videos available on Bilibili.

Highlighted Details

  • Integrates SenseVoice (ASR), Qwen2.5 (LLM), and CosyVoice/Edge-TTS/pyttsx3 (TTS).
  • Supports real-time, interruptible voice interaction with VAD.
  • Offers multimodal interaction with Qwen2-VL-2B for image/video input.
  • Adds voiceprint recognition (CAM++) and custom wake-word functionality.
  • Implements dialogue history memory.

Maintenance & Community

The project is actively updated, with recent additions in late 2024. Links to Bilibili demos are provided for visual examples.

Licensing & Compatibility

The README does not explicitly state the license for the project itself or its constituent models. Users should verify licensing for each component (SenseVoice, Qwen2.5, CosyVoice, Edge-TTS, pyttsx3) before commercial use.

Limitations & Caveats

CosyVoice is noted as having slow inference speeds, impacting real-time performance. Some dependencies, like pynini, may require specific installation methods (e.g., using conda). The project's overall licensing status requires clarification for commercial applications.

Health Check
Last commit

5 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
177 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.