VideoChat by Henry-23

Digital human for real-time voice interaction

Created 1 year ago

1,182 stars

Top 32.8% on SourcePulse

Project Summary

This project provides a real-time voice-interactive digital human system, targeting developers and researchers interested in creating AI-powered avatars. It offers two main approaches: a cascaded ASR-LLM-TTS-THG pipeline and an end-to-end MLLM-THG solution, enabling customizable appearances and voices with low initial latency.

How It Works

The system leverages a cascaded architecture using FunASR for speech recognition, Qwen for language modeling, GPT-SoVITS/CosyVoice/edge-tts for text-to-speech, and MuseTalk for talking head generation. Alternatively, it supports an end-to-end approach with GLM-4-Voice for a more integrated experience. This modular design allows flexibility in choosing components and deployment strategies, including local inference or API-based services.

Quick Start & Requirements

Install: Clone the repository, create a Conda environment, and install requirements.
Prerequisites: Ubuntu 22.04, Python 3.10, CUDA 12.2, PyTorch 2.3.0.
Weights: Download necessary model weights via Git LFS or manually from specified links.
API Keys: Configure DashScope API keys for Qwen and CosyVoice if not running locally.
Resources: Cascaded: ~8GB VRAM. End-to-end: ~20GB VRAM.
Demo: https://www.modelscope.cn/studios/AI-ModelScope/video_chat

Highlighted Details

Supports voice cloning for custom TTS voices.
Initial package delay as low as 3 seconds for the cascaded solution.
Offers local inference options for LLM (Qwen) and TTS (GPT-SoVITS) modules.
Allows customization of digital human appearance and voice.

Maintenance & Community

The project is hosted on ModelScope and appears to be actively developed with several features already implemented (TTS voice cloning, edge-tts integration, local Qwen inference). Further development plans include vLLM acceleration for GLM-4-Voice and Gradio-webrtc integration.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use and integration with closed-source projects.

Limitations & Caveats

The end-to-end MLLM-THG solution requires significantly more VRAM (~20GB). Some features like Gradio video streaming stability are noted as needing further optimization. Users may need to manually download specific model weights if Git LFS encounters issues.

VideoChat by Henry-23

Explore Similar Projects

S.A.T.U.R.D.A.Y by GRVYDEV

Voila by maitrix-org

xtts2-ui by BoltzmannEntropy

LocalAIVoiceChat by KoljaB

sesame_csm_openai by phildougherty

my-neuro by morettt

10x by 0xCrunchyy

Linly-Talker by Kedreamix

Orpheus-TTS by canopyai

Open-LLM-VTuber by Open-LLM-VTuber

Spark-TTS by SparkAudio

CosyVoice by FunAudioLLM