Step3-VL-10B  by stepfun-ai

Frontier multimodal intelligence in a compact 10B model

Created 1 month ago
398 stars

Top 72.7% on SourcePulse

GitHubView on GitHub
Project Summary

STEP3-VL-10B is a 10B parameter open-source multimodal foundation model designed for high efficiency and frontier-level intelligence. It achieves state-of-the-art performance, matching or surpassing models 10-20x its size, making it ideal for researchers and developers seeking powerful multimodal capabilities in a compact footprint.

How It Works

The model employs a unified, fully unfrozen pre-training strategy on a 1.2T token multimodal corpus, integrating a language-aligned Perception Encoder (PE-lang, 1.8B parameters) with a Qwen3-8B decoder to establish intrinsic vision-language synergy. Frontier capabilities are unlocked via a rigorous post-training pipeline including two-stage supervised fine-tuning and over 1,400 reinforcement learning iterations. Crucially, it implements Parallel Coordinated Reasoning (PaCoRe), which allocates test-time compute to aggregate evidence from parallel visual explorations, significantly boosting complex reasoning performance.

Quick Start & Requirements

Inference is supported via the Hugging Face Transformers library. Recommended development environment: python=3.10, torch>=2.1.0, and transformers=4.57.0. Currently, only bf16 inference is supported, with multi-patch image preprocessing enabled by default.

Highlighted Details

  • Achieves SOTA performance for its 10B parameter class, outperforming models 10-20x larger on benchmarks like MMMU (80.11%), MathVista (85.50%), and OCRBench (89.00%).
  • Demonstrates exceptional STEM Reasoning (AIME 2025: 94.43% with PaCoRe) and Visual Perception (MMBench: 92.38%).
  • Excels in GUI Grounding (ScreenSpot-V2: 92.61%) and OCR tasks (OCRBench: 89.00%), optimized for agentic applications.
  • The PaCoRe inference mode leverages 16 parallel rollouts and a 128K token context length for enhanced reasoning capabilities.

Maintenance & Community

Community support is available via a WeChat group for technical discussions and updates.

Licensing & Compatibility

Licensed under Apache 2.0, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Inference currently supports bf16 data type exclusively. The PaCoRe inference mode requires significantly more computational resources at test time due to parallel rollouts.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
5
Star History
59 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.