ASR-LLM-TTS by ABexit

Speech interaction system integrating ASR, LLM, and TTS

Created 1 year ago

1,085 stars

Top 35.0% on SourcePulse

Project Summary

This project provides a sequential speech interaction system integrating Automatic Speech Recognition (ASR), Large Language Models (LLM), and Text-to-Speech (TTS). It targets developers and researchers building voice-enabled applications, offering a modular framework with multiple model options for flexibility.

How It Works

The system chains together distinct open-source models: SenseVoice for ASR, Qwen2.5 variants for LLM, and CosyVoice, Edge-TTS, or pyttsx3 for TTS. This modular approach allows users to select and swap components based on performance, resource, and quality requirements. Recent updates include voiceprint recognition using CAM++, custom wake-word detection via pinyin matching, and dialogue history memory.

Quick Start & Requirements

Installation: Create a conda environment (conda create -n chatAudio python=3.10), activate it (conda activate chatAudio), and install PyTorch with CUDA support (e.g., pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118). Install other dependencies via pip install -r requirements.txt or specific packages as listed in the README.
Prerequisites: Python 3.10+, CUDA 11.8+ (for the specified PyTorch version), ffmpeg. Manual model downloads may be required if automatic download fails (e.g., due to network restrictions).
Execution: Run example scripts like python 13_SenceVoice_QWen2.5_edgeTTS_realTime.py.
Resources: CosyVoice dependencies like pynini may require conda install -c conda-forge pynini=2.1.6.
Links: Demo videos available on Bilibili.

Highlighted Details

Integrates SenseVoice (ASR), Qwen2.5 (LLM), and CosyVoice/Edge-TTS/pyttsx3 (TTS).
Supports real-time, interruptible voice interaction with VAD.
Offers multimodal interaction with Qwen2-VL-2B for image/video input.
Adds voiceprint recognition (CAM++) and custom wake-word functionality.
Implements dialogue history memory.

Maintenance & Community

The project is actively updated, with recent additions in late 2024. Links to Bilibili demos are provided for visual examples.

Licensing & Compatibility

The README does not explicitly state the license for the project itself or its constituent models. Users should verify licensing for each component (SenseVoice, Qwen2.5, CosyVoice, Edge-TTS, pyttsx3) before commercial use.

Limitations & Caveats

CosyVoice is noted as having slow inference speeds, impacting real-time performance. Some dependencies, like pynini, may require specific installation methods (e.g., using conda). The project's overall licensing status requires clarification for commercial applications.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

32 stars in the last 30 days