memo  by memoavatar

Talking video generation research paper

created 1 year ago
1,047 stars

Top 36.6% on sourcepulse

GitHubView on GitHub
Project Summary

MEMO is a diffusion model for generating expressive talking videos from a single image and an audio clip. It targets researchers and developers in AI video generation, offering a method to create realistic lip-synced and emotionally resonant video content.

How It Works

MEMO employs a memory-guided diffusion approach. It leverages pre-processed embeddings for audio, face, and emotions to condition the diffusion process. This allows the model to capture and translate subtle nuances in speech and emotion into corresponding facial movements and expressions in the generated video, aiming for higher expressiveness and coherence.

Quick Start & Requirements

  • Install: pip install -e . within a conda environment (conda create -n memo python=3.10, conda activate memo, conda install -c conda-forge ffmpeg).
  • Prerequisites: CUDA 12, Python 3.10. Checkpoints are downloaded automatically from Hugging Face.
  • Inference: python inference.py --config configs/inference.yaml --input_image <IMAGE_PATH> --input_audio <AUDIO_PATH> --output_dir <SAVE_PATH>
  • Resources: Tested on H100 and RTX 4090. Inference time is ~1s/frame (H100) or ~2s/frame (RTX 4090) at 30fps, 20 steps.
  • Links: Project Page, arXiv, ComfyUI integration, Gradio app, Demo, Jupyter notebook

Highlighted Details

  • Memory-guided diffusion for expressive talking video generation.
  • Supports finetuning on custom video datasets.
  • Inference speed of ~1-2 seconds per frame on high-end GPUs.
  • Community integrations available (ComfyUI, Gradio).

Maintenance & Community

The project is associated with authors from multiple institutions. Community contributions are welcomed.

Licensing & Compatibility

The repository is released under an unspecified license. The README states the preview model is for "research purposes" and warns against misuse for malicious content. Users must ensure compliance with legal regulations and ethical standards, and unauthorized use of third-party intellectual property is forbidden.

Limitations & Caveats

The project explicitly states it has only open-sourced a "preview model for research purposes," implying potential limitations or incompleteness compared to a production-ready system. Users are responsible for ethical and legal compliance regarding input and output content.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
245 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.