PyTorch code for visual speech processing research paper
Top 84.6% on sourcepulse
VSP-LLM is a PyTorch framework for visual speech processing, enabling tasks like visual speech recognition and translation by integrating Large Language Models (LLMs). It targets researchers and developers working with multimodal AI, offering efficient and context-aware processing of visual speech data.
How It Works
The framework maps input video to an LLM's latent space using a self-supervised visual speech model (AV-HuBERT). It employs a novel deduplication method based on visual speech units to reduce redundant embedded features. Coupled with Low Rank Adaptors (LoRA), this approach enables computationally efficient training.
Quick Start & Requirements
conda create -n vsp-llm python=3.9
), activate, install requirements (pip install -r requirements.txt
), and install fairseq (cd fairseq; pip install --editable ./
).Highlighted Details
Maintenance & Community
The project is associated with the paper "Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing" by Yeo et al. (Findings of ACL EMNLP 2024). No community links (Discord, Slack) are provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The setup requires significant data preprocessing and downloading large pre-trained models (AV-HuBERT, LLaMA2-7B). The framework is presented as code for a specific research paper, and its general usability or ongoing maintenance status is not detailed.
4 months ago
Inactive