Emotion-LLaMA addresses the limitations of single-modality emotion recognition and the challenges in multimodal large language models (MLLMs) for integrating audio and visual cues. It enables nuanced emotion recognition and reasoning by processing audio, visual, and textual inputs, targeting researchers and developers in human-computer interaction, education, and counseling.
How It Works
Emotion-LLaMA integrates audio, visual, and textual data using specialized encoders. Features are aligned into a shared space, and a modified LLaMA model, fine-tuned with instructions, handles the multimodal understanding and reasoning. This approach aims to capture complex emotional expressions more effectively than single-modality methods.
Quick Start & Requirements
- Install: Clone the repository, create a conda environment (
environment.yaml
), and activate it.
- Prerequisites: Requires Llama-2-7b-chat-hf model weights, MiniGPT-v2 checkpoint, and HuBERT-large model. Data from MER2023 is needed for training (access via MER2023 website). Pre-extracted features are available via Google Drive.
- Demo: An online demo is available. Local demo setup involves downloading specific checkpoints and installing
moviepy
, soundfile
, and opencv-python
.
- Links: Online Demo, MER2023 Dataset, Pre-extracted Features
Highlighted Details
- Achieved 3rd place in MER-OV track and 1st place in MER-NOISE track of MER2024 Challenge.
- State-of-the-art performance on EMER dataset (Clue Overlap: 7.83, Label Overlap: 6.25).
- High UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW.
- Utilizes HuBERT (Audio), EVA (Global Visual), MAE (Local Visual), and VideoMAE (Temporal Visual) encoders.
Maintenance & Community
- Accepted at NIPS 2024.
- Code is based on MiniGPT-4.
- GitHub
Licensing & Compatibility
- License: BSD 3-Clause License for code. MER2023 data is under EULA for research purposes only.
- Compatibility: Commercial use of the data is restricted.
Limitations & Caveats
- Raw videos and images from MER2023 cannot be directly distributed due to copyright.
- Training requires significant computational resources and careful setup of multiple pre-trained models and feature extraction.