RynnEC by alibaba-damo-academy

Video MLLM for embodied cognition

Created 5 months ago

383 stars

Top 74.7% on SourcePulse

Project Summary

RynnEC is a video multi-modal large language model (MLLM) designed for embodied cognition tasks, enabling machines to understand and interact with the physical world through video. It targets researchers and developers working on AI agents, robotics, and embodied AI, offering enhanced capabilities in object and spatial understanding from video input.

How It Works

RynnEC integrates large language models with visual encoders to process video data. Its architecture is built upon the Qwen2.5 foundation model, enhanced with specialized visual components. This approach allows RynnEC to perform tasks such as object recognition, spatial reasoning, and video object segmentation, directly interpreting visual information within a conversational context. The model's design emphasizes understanding egocentric video, crucial for embodied agents.

Quick Start & Requirements

Installation: Clone the repository, navigate to the directory, and install dependencies using pip install -e . followed by pip install flash-attn --no-build-isolation.
Prerequisites: Python >= 3.10, PyTorch >= 2.4.0, CUDA >= 11.8, transformers >= 4.46.3.
Resources: Specific hardware requirements are not detailed, but typical MLLM training and inference will benefit from GPUs.
Links:
- Checkpoints: Hugging Face, ModelScope
- Benchmark: Hugging Face
- Demo: Hugging Face Spaces
- Video: YouTube
- Paper: arXiv

Highlighted Details

Offers RynnEC-2B and RynnEC-7B models based on Qwen2.5.
Includes a comprehensive benchmark suite (RynnEC-Bench) evaluating 22 embodied cognitive abilities across object and spatial cognition.
Provides example notebooks for object understanding, spatial understanding, and video object segmentation.

Maintenance & Community

The project is developed by Alibaba DAMO Academy. Further community engagement details (e.g., Discord, Slack) are not explicitly provided in the README.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: The service is intended for non-commercial use ONLY. It is subject to the model licenses of Qwen, OpenAI, and Gemini, as well as ShareGPT privacy practices. Commercial use may require separate licensing or adherence to these terms.

Limitations & Caveats

The project's terms of use explicitly restrict it to non-commercial applications due to dependencies on other models and data sources. Specific performance benchmarks or comparisons against other MLLMs are not detailed in the README.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days