R1-Omni  by HumanMLLM

Multimodal LLM for explainable emotion recognition using reinforcement learning

Created 8 months ago
964 stars

Top 38.3% on SourcePulse

GitHubView on GitHub
Project Summary

R1-Omni is an open-source, omni-multimodal large language model specifically designed for emotion recognition. It leverages Reinforcement Learning with Verifiable Reward (RLVR) to enhance reasoning, understanding, and generalization capabilities, particularly in out-of-distribution scenarios. The project targets researchers and developers working on multimodal AI and affective computing.

How It Works

R1-Omni builds upon the HumanOmni-0.5B base model and integrates RLVR for improved emotion recognition. This approach allows the model to learn from feedback, optimizing its ability to interpret complex emotional cues from both visual and audio data. The RLVR methodology is key to its enhanced performance, especially in generalizing to unseen data distributions.

Quick Start & Requirements

  • Install: Follow installation instructions in the R1-V repository.
  • Prerequisites: Nvidia driver 535.54, PyTorch 2.5.1+cu124, torchvision 0.20.1+cu124, torchaudio 2.5.1+cu124, transformers 4.49.0, flash_attn 2.7.4.
  • Inference Dependencies: Download siglip-224, whisper-large-v3, and bert-base-uncased models and update configuration files (config.json, inference.py) with local paths.
  • Inference Command: python inference.py --modal video_audio --model_path ./R1-Omni-0.5B --video_path video.mp4 --instruct "..."
  • Resources: Requires specific CUDA version and pre-trained models. Setup time depends on model downloads and configuration.
  • Demo: https://github.com/user-attachments/assets/8c73cbe6-5f24-49a9-bef9-bff6c50e4580

Highlighted Details

  • Achieves state-of-the-art performance on emotion recognition benchmarks, including significant improvements on out-of-distribution datasets like RAVDESS.
  • Demonstrates superior reasoning and generalization capabilities compared to standard Supervised Fine-Tuning (SFT) methods.
  • Open-sources multiple model checkpoints: base, cold-start (EMER-SFT), fine-tuned (MAFW-DFEW-SFT), and the final R1-Omni model.
  • Featured in People's Daily and Bloomberg.

Maintenance & Community

  • The project is actively being updated with planned releases for environment setup, integration of HumanOmni source code, and more detailed reproduction processes.
  • Related work includes R1-V, HumanOmni, and DeepSeek-R1.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • The project is still under active development with several items on its ToDo list, including open-sourcing all training data and providing a more detailed reproduction process. The current inference setup requires manual configuration of model paths.
Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.