VideoChat  by Henry-23

Digital human for real-time voice interaction

created 9 months ago
1,033 stars

Top 36.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a real-time voice-interactive digital human system, targeting developers and researchers interested in creating AI-powered avatars. It offers two main approaches: a cascaded ASR-LLM-TTS-THG pipeline and an end-to-end MLLM-THG solution, enabling customizable appearances and voices with low initial latency.

How It Works

The system leverages a cascaded architecture using FunASR for speech recognition, Qwen for language modeling, GPT-SoVITS/CosyVoice/edge-tts for text-to-speech, and MuseTalk for talking head generation. Alternatively, it supports an end-to-end approach with GLM-4-Voice for a more integrated experience. This modular design allows flexibility in choosing components and deployment strategies, including local inference or API-based services.

Quick Start & Requirements

  • Install: Clone the repository, create a Conda environment, and install requirements.
  • Prerequisites: Ubuntu 22.04, Python 3.10, CUDA 12.2, PyTorch 2.3.0.
  • Weights: Download necessary model weights via Git LFS or manually from specified links.
  • API Keys: Configure DashScope API keys for Qwen and CosyVoice if not running locally.
  • Resources: Cascaded: ~8GB VRAM. End-to-end: ~20GB VRAM.
  • Demo: https://www.modelscope.cn/studios/AI-ModelScope/video_chat

Highlighted Details

  • Supports voice cloning for custom TTS voices.
  • Initial package delay as low as 3 seconds for the cascaded solution.
  • Offers local inference options for LLM (Qwen) and TTS (GPT-SoVITS) modules.
  • Allows customization of digital human appearance and voice.

Maintenance & Community

The project is hosted on ModelScope and appears to be actively developed with several features already implemented (TTS voice cloning, edge-tts integration, local Qwen inference). Further development plans include vLLM acceleration for GLM-4-Voice and Gradio-webrtc integration.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use and integration with closed-source projects.

Limitations & Caveats

The end-to-end MLLM-THG solution requires significantly more VRAM (~20GB). Some features like Gradio video streaming stability are noted as needing further optimization. Users may need to manually download specific model weights if Git LFS encounters issues.

Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
138 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

MiniCPM-o by OpenBMB

0.2%
20k
MLLM for vision, speech, and multimodal live streaming on your phone
created 1 year ago
updated 1 month ago
Feedback? Help us improve.