R1-Omni by HumanMLLM

Multimodal LLM for explainable emotion recognition using reinforcement learning

Created 10 months ago

990 stars

Top 37.4% on SourcePulse

Project Summary

R1-Omni is an open-source, omni-multimodal large language model specifically designed for emotion recognition. It leverages Reinforcement Learning with Verifiable Reward (RLVR) to enhance reasoning, understanding, and generalization capabilities, particularly in out-of-distribution scenarios. The project targets researchers and developers working on multimodal AI and affective computing.

How It Works

R1-Omni builds upon the HumanOmni-0.5B base model and integrates RLVR for improved emotion recognition. This approach allows the model to learn from feedback, optimizing its ability to interpret complex emotional cues from both visual and audio data. The RLVR methodology is key to its enhanced performance, especially in generalizing to unseen data distributions.

Quick Start & Requirements

Install: Follow installation instructions in the R1-V repository.
Prerequisites: Nvidia driver 535.54, PyTorch 2.5.1+cu124, torchvision 0.20.1+cu124, torchaudio 2.5.1+cu124, transformers 4.49.0, flash_attn 2.7.4.
Inference Dependencies: Download siglip-224, whisper-large-v3, and bert-base-uncased models and update configuration files (config.json, inference.py) with local paths.
Inference Command: python inference.py --modal video_audio --model_path ./R1-Omni-0.5B --video_path video.mp4 --instruct "..."
Resources: Requires specific CUDA version and pre-trained models. Setup time depends on model downloads and configuration.
Demo: https://github.com/user-attachments/assets/8c73cbe6-5f24-49a9-bef9-bff6c50e4580

Highlighted Details

Achieves state-of-the-art performance on emotion recognition benchmarks, including significant improvements on out-of-distribution datasets like RAVDESS.
Demonstrates superior reasoning and generalization capabilities compared to standard Supervised Fine-Tuning (SFT) methods.
Open-sources multiple model checkpoints: base, cold-start (EMER-SFT), fine-tuned (MAFW-DFEW-SFT), and the final R1-Omni model.
Featured in People's Daily and Bloomberg.

Maintenance & Community

The project is actively being updated with planned releases for environment setup, integration of HumanOmni source code, and more detailed reproduction processes.
Related work includes R1-V, HumanOmni, and DeepSeek-R1.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is still under active development with several items on its ToDo list, including open-sourcing all training data and providing a more detailed reproduction process. The current inference setup requires manual configuration of model paths.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

6 stars in the last 30 days

Explore Similar Projects

awesome-audiovisual-learning by GeWu-Lab

Curated list of audio-visual learning methods and datasets

Created 3 years ago

Updated 1 year ago

r1_reward by yfzhang114

Multimodal reward modeling via stable reinforcement learning

Created 8 months ago

Updated 8 months ago

AffectGPT by zeroQiaoba

Multimodal emotion reasoning for open-vocabulary understanding

Created 2 years ago

Updated 5 months ago

MERTools by zeroQiaoba

Multimodal Emotion Recognition toolkits and benchmarks

Created 2 years ago

Updated 7 months ago

MM-EUREKA by ModalMinds

Multimodal reasoning models using rule-based reinforcement learning

Created 10 months ago

Updated 4 months ago

Awesome-RL-based-Reasoning-MLLMs by Sun-Haoyuan23

Curated list for RL-based reasoning in multimodal LLMs

Created 10 months ago

Updated 1 month ago

Emotion-LLaMA by ZebangCheng

Multimodal emotion recognition and reasoning via instruction tuning

Created 1 year ago

Updated 1 month ago

LingChat by SlimeBoyOwO

AI companion with emotional expressions and memory

Created 9 months ago

Updated 1 day ago

Interactive-LLM-Powered-NPCs by AkshitIreddy

Game mod for interactive, LLM-powered NPC conversations

Created 2 years ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

audio-flamingo by NVIDIA

PyTorch code for an audio-language model research paper

Created 1 year ago

Updated 3 weeks ago

Starred by

Zack Li

Zack Li(Cofounder of Nexa AI).

Visual-RFT by Liuziyu77

Visual reinforcement fine-tuning research paper

Created 10 months ago

Updated 2 months ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo) and

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

DeepSeek-VL by deepseek-ai

Vision-language model for real-world applications (research paper)

Created 1 year ago

Updated 1 year ago

Feedback? Help us improve.