Step3-VL-10B by stepfun-ai

Frontier multimodal intelligence in a compact 10B model

Created 5 months ago

406 stars

Top 71.2% on SourcePulse

Project Summary

STEP3-VL-10B is a 10B parameter open-source multimodal foundation model designed for high efficiency and frontier-level intelligence. It achieves state-of-the-art performance, matching or surpassing models 10-20x its size, making it ideal for researchers and developers seeking powerful multimodal capabilities in a compact footprint.

How It Works

The model employs a unified, fully unfrozen pre-training strategy on a 1.2T token multimodal corpus, integrating a language-aligned Perception Encoder (PE-lang, 1.8B parameters) with a Qwen3-8B decoder to establish intrinsic vision-language synergy. Frontier capabilities are unlocked via a rigorous post-training pipeline including two-stage supervised fine-tuning and over 1,400 reinforcement learning iterations. Crucially, it implements Parallel Coordinated Reasoning (PaCoRe), which allocates test-time compute to aggregate evidence from parallel visual explorations, significantly boosting complex reasoning performance.

Quick Start & Requirements

Inference is supported via the Hugging Face Transformers library. Recommended development environment: python=3.10, torch>=2.1.0, and transformers=4.57.0. Currently, only bf16 inference is supported, with multi-patch image preprocessing enabled by default.

Highlighted Details

Achieves SOTA performance for its 10B parameter class, outperforming models 10-20x larger on benchmarks like MMMU (80.11%), MathVista (85.50%), and OCRBench (89.00%).
Demonstrates exceptional STEM Reasoning (AIME 2025: 94.43% with PaCoRe) and Visual Perception (MMBench: 92.38%).
Excels in GUI Grounding (ScreenSpot-V2: 92.61%) and OCR tasks (OCRBench: 89.00%), optimized for agentic applications.
The PaCoRe inference mode leverages 16 parallel rollouts and a 128K token context length for enhanced reasoning capabilities.

Maintenance & Community

Community support is available via a WeChat group for technical discussions and updates.

Licensing & Compatibility

Licensed under Apache 2.0, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Inference currently supports bf16 data type exclusively. The PaCoRe inference mode requires significantly more computational resources at test time due to parallel rollouts.

Step3-VL-10B by stepfun-ai

Explore Similar Projects

dots.vlm1 by rednote-hilab

cobra by OpenHelix-Team

Mirage by UMass-Embodied-AGI

Thyme by yfzhang114

Thinking-with-Visual-Primitives by ailuntx

Seed1.5-VL by ByteDance-Seed

Kimi-VL by MoonshotAI

Lance by bytedance

Qwen3.6 by QwenLM

Skywork-R1V by SkyworkAI

SenseNova-U1 by OpenSenseNova

DeepSeek-VL by deepseek-ai