OneThinker  by tulerfeng

An all-in-one model for multimodal reasoning across image and video

Created 1 month ago
359 stars

Top 78.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> OneThinker is an all-in-one multimodal reasoning generalist designed for image and video analysis. It targets researchers and engineers needing a unified model for diverse visual tasks, offering cross-task knowledge transfer and zero-shot generalization benefits.

How It Works

This project introduces OneThinker, a unified multimodal reasoning model built upon Qwen3-VL-8B. It leverages a large-scale OneThinker-600k multi-task corpus and a high-quality OneThinker-SFT-340k dataset with Chain-of-Thought (CoT) annotations. A novel EMA-GRPO reinforcement learning method balances heterogeneous reward signals across tasks, enabling effective cross-task and cross-modality knowledge transfer.

Quick Start & Requirements

  • Primary install / run command: Setup involves creating two Conda environments: llamafactory (Python 3.11) with pip install -e ".[torch,metrics]" and easyr1 (Python 3.11) with pip install -e .. Download required datasets.
  • Non-default prerequisites: Python 3.11, PyTorch, and a minimum of 8 x 80GB GPUs for training.
  • Estimated setup time or resource footprint: Training requires substantial GPU resources (8 x 80GB GPUs).
  • Links: Models, training data, and evaluation scripts are available via Hugging Face links provided in the repository. The paper is cited as arXiv preprint arXiv:2512.03043.

Highlighted Details

  • OneThinker achieves strong performance across 31 benchmarks in 10 fundamental vision tasks, including 70.6% on MMMU and 93.7 on Refcoco-testA.
  • Demonstrates beneficial cross-task and cross-modality knowledge transfer, alongside promising zero-shot generalization capabilities within its unified training framework.
  • Supports image-video mixed training and provides a full pipeline from dataset preparation to evaluation.

Maintenance & Community

  • The project is associated with authors listed in the arXiv preprint (Feng et al., 2025). No specific community channels (e.g., Discord, Slack) or roadmap are detailed.

Licensing & Compatibility

  • License type and compatibility notes for commercial use are not specified in the provided README.

Limitations & Caveats

  • Training demands significant hardware (minimum 8 x 80GB GPUs). Segmentation evaluation requires separate installation of sam2, and some QA evaluations depend on VLMEvalKit. The project mandates specific Python versions (3.11) and distinct environment setups for SFT and RL.
Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
9
Star History
70 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.