This project provides a modular, interactive digital human conversation system designed to run on a single PC, targeting developers and researchers in AI and virtual reality. It offers low-latency, multimodal conversations with customizable components, enabling flexible integration of various AI models for speech, language, and avatar rendering.
How It Works
The system employs a modular architecture, allowing users to swap components for Automatic Speech Recognition (ASR), Large Language Models (LLM), Text-to-Speech (TTS), and avatar rendering. It supports both a fully local mode using models like MiniCPM-o and a hybrid mode leveraging cloud APIs for LLM and TTS. This flexibility reduces system requirements and allows for diverse conversational experiences.
Quick Start & Requirements
- Installation: Recommended to use
uv
for environment management. Install dependencies via uv sync --all-packages
or mode-specific installs. Run via uv run src/demo.py --config <config_file.yaml>
. Docker execution is also supported via ./build_and_run.sh --config <config_file.yaml>
.
- Prerequisites: Python >=3.10, <3.12. CUDA-enabled GPU with NVIDIA driver supporting CUDA >= 12.4. Unquantized MiniCPM-o requires >20GB VRAM; int4 quantized version reduces VRAM needs. Git LFS is required for submodules.
- Resources: Local MiniCPM-o inference can achieve ~2.2s average response delay on an i9-13900KF with RTX 4090. CPU inference can reach up to 30 FPS.
- Links: Demo, LiteAvatarGallery, LAM
Highlighted Details
- Low-latency (avg. 2.2s) real-time digital human conversation.
- Supports multimodal LLMs (text, audio, video).
- Modular design for flexible component replacement.
- Integrates LiteAvatar for 2D avatars and LAM for ultra-realistic 3D digital humans.
- Offers multiple pre-set configurations for different model combinations.
Maintenance & Community
- Active development with recent releases (v0.3.0 on 2025.04.18).
- Community contributions acknowledged, with links to deployment tutorials.
- Project is actively maintained by HumanAIGC-Engineering.
Licensing & Compatibility
- The repository itself appears to be under a permissive license, but specific component licenses (e.g., for models like MiniCPM-o, CosyVoice) should be reviewed individually for commercial use restrictions.
Limitations & Caveats
- CosyVoice local TTS on Windows requires a specific Conda installation workaround due to
pynini
compilation issues.
- Using video input with MiniCPM-o can significantly increase VRAM consumption, potentially leading to OOM errors on lower-spec GPUs.
- LAM avatar generation pipeline is noted as "not ready yet."