VSP-LLM  by Sally-SH

PyTorch code for visual speech processing research paper

created 1 year ago
327 stars

Top 84.6% on sourcepulse

GitHubView on GitHub
Project Summary

VSP-LLM is a PyTorch framework for visual speech processing, enabling tasks like visual speech recognition and translation by integrating Large Language Models (LLMs). It targets researchers and developers working with multimodal AI, offering efficient and context-aware processing of visual speech data.

How It Works

The framework maps input video to an LLM's latent space using a self-supervised visual speech model (AV-HuBERT). It employs a novel deduplication method based on visual speech units to reduce redundant embedded features. Coupled with Low Rank Adaptors (LoRA), this approach enables computationally efficient training.

Quick Start & Requirements

  • Install: Clone repo, create conda env (conda create -n vsp-llm python=3.9), activate, install requirements (pip install -r requirements.txt), and install fairseq (cd fairseq; pip install --editable ./).
  • Prerequisites: AV-HuBERT Large checkpoint, LLaMA2-7B checkpoint, LRS3 dataset preprocessed according to Auto-AVSR and AV-HuBERT guidelines, and generated visual speech unit/cluster count files.
  • Links: Model Checkpoint, AV-HuBERT, LLaMA2-7B.

Highlighted Details

  • Integrates LLMs for enhanced context modeling in visual speech tasks.
  • Supports multi-task learning for visual speech recognition and translation via instructions.
  • Features a novel deduplication method using visual speech units.
  • Utilizes LoRA for computationally efficient training.

Maintenance & Community

The project is associated with the paper "Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing" by Yeo et al. (Findings of ACL EMNLP 2024). No community links (Discord, Slack) are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The setup requires significant data preprocessing and downloading large pre-trained models (AV-HuBERT, LLaMA2-7B). The framework is presented as code for a specific research paper, and its general usability or ongoing maintenance status is not detailed.

Health Check
Last commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.