VSP-LLM by Sally-SH

PyTorch code for visual speech processing research paper

Created 1 year ago

338 stars

Top 81.6% on SourcePulse

Project Summary

VSP-LLM is a PyTorch framework for visual speech processing, enabling tasks like visual speech recognition and translation by integrating Large Language Models (LLMs). It targets researchers and developers working with multimodal AI, offering efficient and context-aware processing of visual speech data.

How It Works

The framework maps input video to an LLM's latent space using a self-supervised visual speech model (AV-HuBERT). It employs a novel deduplication method based on visual speech units to reduce redundant embedded features. Coupled with Low Rank Adaptors (LoRA), this approach enables computationally efficient training.

Quick Start & Requirements

Install: Clone repo, create conda env (conda create -n vsp-llm python=3.9), activate, install requirements (pip install -r requirements.txt), and install fairseq (cd fairseq; pip install --editable ./).
Prerequisites: AV-HuBERT Large checkpoint, LLaMA2-7B checkpoint, LRS3 dataset preprocessed according to Auto-AVSR and AV-HuBERT guidelines, and generated visual speech unit/cluster count files.
Links: Model Checkpoint, AV-HuBERT, LLaMA2-7B.

Highlighted Details

Integrates LLMs for enhanced context modeling in visual speech tasks.
Supports multi-task learning for visual speech recognition and translation via instructions.
Features a novel deduplication method using visual speech units.
Utilizes LoRA for computationally efficient training.

Maintenance & Community

The project is associated with the paper "Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing" by Yeo et al. (Findings of ACL EMNLP 2024). No community links (Discord, Slack) are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The setup requires significant data preprocessing and downloading large pre-trained models (AV-HuBERT, LLaMA2-7B). The framework is presented as code for a specific research paper, and its general usability or ongoing maintenance status is not detailed.

VSP-LLM by Sally-SH

Explore Similar Projects

bc-omni by westlake-baichuan-mllm

Ola by Ola-Omni

MGM-Omni by JIA-Lab-research

Lyra by JIA-Lab-research

Meta-voicebox by SpeechifyInc

edgedict by theblackcat102

LaViLa by facebookresearch

SLAM-LLM by X-LANCE

av_hubert by facebookresearch

VITA by VITA-MLLM

ultravox by fixie-ai

Spark-TTS by SparkAudio