Emotion-LLaMA by ZebangCheng

Multimodal emotion recognition and reasoning via instruction tuning

Created 1 year ago

511 stars

Top 61.2% on SourcePulse

Project Summary

Emotion-LLaMA addresses the limitations of single-modality emotion recognition and the challenges in multimodal large language models (MLLMs) for integrating audio and visual cues. It enables nuanced emotion recognition and reasoning by processing audio, visual, and textual inputs, targeting researchers and developers in human-computer interaction, education, and counseling.

How It Works

Emotion-LLaMA integrates audio, visual, and textual data using specialized encoders. Features are aligned into a shared space, and a modified LLaMA model, fine-tuned with instructions, handles the multimodal understanding and reasoning. This approach aims to capture complex emotional expressions more effectively than single-modality methods.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (environment.yaml), and activate it.
Prerequisites: Requires Llama-2-7b-chat-hf model weights, MiniGPT-v2 checkpoint, and HuBERT-large model. Data from MER2023 is needed for training (access via MER2023 website). Pre-extracted features are available via Google Drive.
Demo: An online demo is available. Local demo setup involves downloading specific checkpoints and installing moviepy, soundfile, and opencv-python.
Links: Online Demo, MER2023 Dataset, Pre-extracted Features

Highlighted Details

Achieved 3rd place in MER-OV track and 1st place in MER-NOISE track of MER2024 Challenge.
State-of-the-art performance on EMER dataset (Clue Overlap: 7.83, Label Overlap: 6.25).
High UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW.
Utilizes HuBERT (Audio), EVA (Global Visual), MAE (Local Visual), and VideoMAE (Temporal Visual) encoders.

Maintenance & Community

Accepted at NIPS 2024.
Code is based on MiniGPT-4.
GitHub

Licensing & Compatibility

License: BSD 3-Clause License for code. MER2023 data is under EULA for research purposes only.
Compatibility: Commercial use of the data is restricted.

Limitations & Caveats

Raw videos and images from MER2023 cannot be directly distributed due to copyright.
Training requires significant computational resources and careful setup of multiple pre-trained models and feature extraction.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

5

Star History

31 stars in the last 30 days

Explore Similar Projects

Starred by

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen),

Luca Soldaini

Luca Soldaini(Research Scientist at Ai2), and

1 more.

instruction-datasets by raunak-agarwal

Dataset list for instruction tuning of LLMs

Created 2 years ago

Updated 2 years ago

VARGPT by VARGPT-family

Multimodal LLM for visual understanding and generation tasks

Created 11 months ago

Updated 8 months ago

SEED-X by AILab-CVC

Multimodal AI assistant for real-world applications

Created 1 year ago

Updated 10 months ago

AffectGPT by zeroQiaoba

Multimodal emotion reasoning for open-vocabulary understanding

Created 2 years ago

Updated 5 months ago

MERTools by zeroQiaoba

Multimodal Emotion Recognition toolkits and benchmarks

Created 2 years ago

Updated 7 months ago

dl-for-emo-tts by Emotional-Text-to-Speech

Deep learning approaches for emotional text-to-speech

Created 5 years ago

Updated 1 year ago

Uni-MoE by HITsz-TMG

Research paper on scaling unified multimodal LLMs with MoE

Created 1 year ago

Updated 3 weeks ago

R1-Omni by HumanMLLM

Multimodal LLM for explainable emotion recognition using reinforcement learning

Created 10 months ago

Updated 9 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

audio-flamingo by NVIDIA

PyTorch code for an audio-language model research paper

Created 1 year ago

Updated 3 weeks ago

Starred by

Simon Willison

Simon Willison(Coauthor of Django),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

1 more.

Aria by rhymes-ai

Multimodal MoE model for video, document understanding, and dialog

Created 1 year ago

Updated 11 months ago

SLAM-LLM by X-LANCE

MLLM toolkit for speech, language, audio, and music processing

Created 2 years ago

Updated 2 months ago

Starred by

Alex Chen

Alex Chen(Cofounder of Nexa AI),

Ying Sheng

Ying Sheng(Coauthor of SGLang), and

18 more.

NeMo by NVIDIA-NeMo

Scalable generative AI framework for LLMs, multimodal, and speech AI research

Created 6 years ago

Updated 23 hours ago

Feedback? Help us improve.