memo by memoavatar

Talking video generation research paper

Created 1 year ago

1,075 stars

Top 35.2% on SourcePulse

Project Summary

MEMO is a diffusion model for generating expressive talking videos from a single image and an audio clip. It targets researchers and developers in AI video generation, offering a method to create realistic lip-synced and emotionally resonant video content.

How It Works

MEMO employs a memory-guided diffusion approach. It leverages pre-processed embeddings for audio, face, and emotions to condition the diffusion process. This allows the model to capture and translate subtle nuances in speech and emotion into corresponding facial movements and expressions in the generated video, aiming for higher expressiveness and coherence.

Quick Start & Requirements

Install: pip install -e . within a conda environment (conda create -n memo python=3.10, conda activate memo, conda install -c conda-forge ffmpeg).
Prerequisites: CUDA 12, Python 3.10. Checkpoints are downloaded automatically from Hugging Face.
Inference: python inference.py --config configs/inference.yaml --input_image <IMAGE_PATH> --input_audio <AUDIO_PATH> --output_dir <SAVE_PATH>
Resources: Tested on H100 and RTX 4090. Inference time is ~1s/frame (H100) or ~2s/frame (RTX 4090) at 30fps, 20 steps.
Links: Project Page, arXiv, ComfyUI integration, Gradio app, Demo, Jupyter notebook

Highlighted Details

Memory-guided diffusion for expressive talking video generation.
Supports finetuning on custom video datasets.
Inference speed of ~1-2 seconds per frame on high-end GPUs.
Community integrations available (ComfyUI, Gradio).

Maintenance & Community

The project is associated with authors from multiple institutions. Community contributions are welcomed.

Licensing & Compatibility

The repository is released under an unspecified license. The README states the preview model is for "research purposes" and warns against misuse for malicious content. Users must ensure compliance with legal regulations and ethical standards, and unauthorized use of third-party intellectual property is forbidden.

Limitations & Caveats

The project explicitly states it has only open-sourced a "preview model for research purposes," implying potential limitations or incompleteness compared to a production-ready system. Users are responsible for ethical and legal compliance regarding input and output content.

memo by memoavatar

Explore Similar Projects

Ola by Ola-Omni

tarsier by bytedance

text2youtube by artkulak

AI-Faceless-Video-Generator by SamurAIGPT

Video-ChatGPT by mbzuai-oryx

Lip2Wav by Rudrabha

auto-video-generateor by kuangdd2024

hallo3 by fudan-generative-vision

Sonic by jixiaozhong

Video-LLaMA by DAMO-NLP-SG

Qwen2.5-Omni by QwenLM

LWM by LargeWorldModel