AvatarForcing by TaekyungKi

Real-time interactive head avatar generation for natural conversation

Created 8 months ago

333 stars

Top 82.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

This project addresses the limitations of current talking head generation models, which often fail to convey true interactivity and emotional engagement. It introduces Avatar Forcing, a framework designed for real-time, interactive head avatar generation that enables avatars to process multimodal inputs and react instantly to user cues. The target audience includes researchers and developers in virtual communication and content creation, offering a path towards more human-like conversational avatars.

How It Works

Avatar Forcing models real-time user-avatar interactions using diffusion forcing, allowing for low-latency processing of multimodal inputs like user audio and motion. This enables immediate reactions to speech, nods, and laughter. The framework also incorporates a novel direct preference optimization method that leverages synthetic losing samples, constructed by dropping user conditions, to learn expressive interaction without requiring labeled data.

Quick Start & Requirements

Installation: Requires Python 3.10 and PyTorch 2.0.1 with CUDA 11.8. Installation involves creating a Conda environment (conda create -n avatarforcing python==3.10), activating it (conda activate avatarforcing), installing PyTorch, and then installing project requirements (pip install -r requirements.txt).
Prerequisites: CUDA 11.8, PyTorch 2.0.1, and Python 3.10. Model checkpoints (main DFoT, motion AE, and Wav2Vec2) must be downloaded from Google Drive and HuggingFace, respectively, and organized in a ./pretrained_dir folder.
Preprocessing: Relies on external tools like IIANet for target speaker extraction and ClearVoice for speaker separation. A script preprocess_user_video.py is provided for video frame and facial region extraction.
Inference: A minimal PyTorch inference pipeline is supported via inference.py.
Links: Model weights: Google Drive, Wav2Vec2: HuggingFace, ClearVoice: GitHub.

Highlighted Details

Achieves real-time interaction with low latency, approximately 500ms.
Offers a 6.8x speedup compared to baseline methods.
Generates reactive and expressive avatar motion, preferred by users over 80% against baselines.

Maintenance & Community

The project is associated with CVPR 2026 and lists authors from KAIST, NTU Singapore, and DeepAuto.ai. No specific community channels (e.g., Discord, Slack), roadmap, or active maintenance signals beyond the publication are detailed in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Given its publication in a major computer vision conference, it is likely intended for research purposes, and commercial use may be restricted. Compatibility with closed-source applications is not specified.

Limitations & Caveats

This repository provides only a minimal PyTorch inference pipeline and does not include real-time conversational demo applications or integrations with services like GPT Voice API. Building a full real-time conversational avatar system is possible but falls outside the scope of this repository. The quality of generated avatars is highly dependent on the performance of external audio and video preprocessing tools.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days