R1-Onevision  by Fancy-MLLM

Multimodal LLM for visual reasoning tasks

created 5 months ago
549 stars

Top 59.0% on sourcepulse

GitHubView on GitHub
Project Summary

R1-Onevision is a multimodal reasoning large language model designed to tackle complex visual reasoning tasks by integrating visual and textual data. It aims to provide precise interpretations for domains like mathematics, science, and logical reasoning, serving as a powerful AI assistant for problem-solving.

How It Works

The model employs a cross-modal reasoning pipeline that transforms images into formal textual representations, enabling language-based reasoning. This approach is facilitated by the R1-Onevision dataset, which contains detailed, step-by-step multimodal reasoning annotations. The model is further developed through supervised fine-tuning and reinforcement learning to enhance reasoning and generalization abilities.

Quick Start & Requirements

Highlighted Details

  • Fine-tuned from Qwen2.5-VL on the R1-Onevision dataset.
  • R1-Onevision-Bench benchmark is aligned with human educational stages.
  • Dataset includes diverse domains: natural scenes, science, math, OCR, charts.
  • Supports deep CoT reasoning.

Maintenance & Community

  • Project is actively updated with new versions of dataset, models, and benchmark.
  • Developed by Zhejiang University.
  • Open to ideas and contributions.

Licensing & Compatibility

  • The README does not explicitly state the license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a research artifact with recent releases, suggesting it may still be in an experimental or evolving stage. Specific limitations or unsupported features are not detailed in the README.

Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
39 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.