R1-Onevision  by Fancy-MLLM

Multimodal LLM for visual reasoning tasks

Created 7 months ago
566 stars

Top 56.8% on SourcePulse

GitHubView on GitHub
Project Summary

R1-Onevision is a multimodal reasoning large language model designed to tackle complex visual reasoning tasks by integrating visual and textual data. It aims to provide precise interpretations for domains like mathematics, science, and logical reasoning, serving as a powerful AI assistant for problem-solving.

How It Works

The model employs a cross-modal reasoning pipeline that transforms images into formal textual representations, enabling language-based reasoning. This approach is facilitated by the R1-Onevision dataset, which contains detailed, step-by-step multimodal reasoning annotations. The model is further developed through supervised fine-tuning and reinforcement learning to enhance reasoning and generalization abilities.

Quick Start & Requirements

Highlighted Details

  • Fine-tuned from Qwen2.5-VL on the R1-Onevision dataset.
  • R1-Onevision-Bench benchmark is aligned with human educational stages.
  • Dataset includes diverse domains: natural scenes, science, math, OCR, charts.
  • Supports deep CoT reasoning.

Maintenance & Community

  • Project is actively updated with new versions of dataset, models, and benchmark.
  • Developed by Zhejiang University.
  • Open to ideas and contributions.

Licensing & Compatibility

  • The README does not explicitly state the license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a research artifact with recent releases, suggesting it may still be in an experimental or evolving stage. Specific limitations or unsupported features are not detailed in the README.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.