Digital human for real-time voice interaction
Top 36.9% on sourcepulse
This project provides a real-time voice-interactive digital human system, targeting developers and researchers interested in creating AI-powered avatars. It offers two main approaches: a cascaded ASR-LLM-TTS-THG pipeline and an end-to-end MLLM-THG solution, enabling customizable appearances and voices with low initial latency.
How It Works
The system leverages a cascaded architecture using FunASR for speech recognition, Qwen for language modeling, GPT-SoVITS/CosyVoice/edge-tts for text-to-speech, and MuseTalk for talking head generation. Alternatively, it supports an end-to-end approach with GLM-4-Voice for a more integrated experience. This modular design allows flexibility in choosing components and deployment strategies, including local inference or API-based services.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project is hosted on ModelScope and appears to be actively developed with several features already implemented (TTS voice cloning, edge-tts integration, local Qwen inference). Further development plans include vLLM acceleration for GLM-4-Voice and Gradio-webrtc integration.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use and integration with closed-source projects.
Limitations & Caveats
The end-to-end MLLM-THG solution requires significantly more VRAM (~20GB). Some features like Gradio video streaming stability are noted as needing further optimization. Users may need to manually download specific model weights if Git LFS encounters issues.
4 months ago
1 day