OneThinker by tulerfeng

An all-in-one model for multimodal reasoning across image and video

Created 2 months ago

404 stars

Top 72.0% on SourcePulse

1 Expert Loves This Project

hiyouga

Author of LLaMA-Factory

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> OneThinker is an all-in-one multimodal reasoning generalist designed for image and video analysis. It targets researchers and engineers needing a unified model for diverse visual tasks, offering cross-task knowledge transfer and zero-shot generalization benefits.

How It Works

This project introduces OneThinker, a unified multimodal reasoning model built upon Qwen3-VL-8B. It leverages a large-scale OneThinker-600k multi-task corpus and a high-quality OneThinker-SFT-340k dataset with Chain-of-Thought (CoT) annotations. A novel EMA-GRPO reinforcement learning method balances heterogeneous reward signals across tasks, enabling effective cross-task and cross-modality knowledge transfer.

Quick Start & Requirements

Primary install / run command: Setup involves creating two Conda environments: llamafactory (Python 3.11) with pip install -e ".[torch,metrics]" and easyr1 (Python 3.11) with pip install -e .. Download required datasets.
Non-default prerequisites: Python 3.11, PyTorch, and a minimum of 8 x 80GB GPUs for training.
Estimated setup time or resource footprint: Training requires substantial GPU resources (8 x 80GB GPUs).
Links: Models, training data, and evaluation scripts are available via Hugging Face links provided in the repository. The paper is cited as arXiv preprint arXiv:2512.03043.

Highlighted Details

OneThinker achieves strong performance across 31 benchmarks in 10 fundamental vision tasks, including 70.6% on MMMU and 93.7 on Refcoco-testA.
Demonstrates beneficial cross-task and cross-modality knowledge transfer, alongside promising zero-shot generalization capabilities within its unified training framework.
Supports image-video mixed training and provides a full pipeline from dataset preparation to evaluation.

Maintenance & Community

The project is associated with authors listed in the arXiv preprint (Feng et al., 2025). No specific community channels (e.g., Discord, Slack) or roadmap are detailed.

Licensing & Compatibility

License type and compatibility notes for commercial use are not specified in the provided README.

Limitations & Caveats

Training demands significant hardware (minimum 8 x 80GB GPUs). Segmentation evaluation requires separate installation of sam2, and some QA evaluations depend on VLMEvalKit. The project mandates specific Python versions (3.11) and distinct environment setups for SFT and RL.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

5

Star History

28 stars in the last 30 days

Explore Similar Projects

Thinking-with-Video by tongjingqi

Video generation as a multimodal reasoning paradigm

Created 3 months ago

Updated 4 days ago

Awesome-Video-LMM-Post-Training by yunlong10

Curated research on advanced video reasoning with large multimodal models

Created 8 months ago

Updated 3 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

cobra by h-zhao1997

Multimodal LLM research paper extending Mamba for efficient inference

Created 1 year ago

Updated 1 year ago

VisionReasoner by JIA-Lab-research

Unified visual perception and reasoning framework

Created 10 months ago

Updated 2 weeks ago

CoVT by Wakals

Enabling VLMs to reason in continuous visual space

Created 4 months ago

Updated 1 month ago

MiMo-VL by XiaomiMiMo

Vision-language model for complex reasoning

Created 9 months ago

Updated 6 months ago

Thyme by yfzhang114

Multimodal reasoning and code execution for complex visual tasks

Created 6 months ago

Updated 5 months ago

Vision-DeepResearch by Osilly

Multimodal LLM for deep research and extensive search

Created 3 weeks ago

Updated 2 weeks ago

Awesome-RL-based-Reasoning-MLLMs by Sun-Haoyuan23

Curated list for RL-based reasoning in multimodal LLMs

Created 1 year ago

Updated 2 months ago

Seed1.5-VL by ByteDance-Seed

Vision-language foundation model for multimodal understanding/reasoning

Created 9 months ago

Updated 8 months ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

BiomedGPT by taokz

Vision-language foundation model for diverse biomedical tasks

Created 2 years ago

Updated 7 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

VLM-R1 by om-ai-lab

VLM for visual understanding via reinforced VLMs

Created 1 year ago

Updated 4 months ago

Feedback? Help us improve.