Discover and explore top open-source AI tools and projects—updated daily.
stepfun-aiFrontier multimodal intelligence in a compact 10B model
Top 72.7% on SourcePulse
STEP3-VL-10B is a 10B parameter open-source multimodal foundation model designed for high efficiency and frontier-level intelligence. It achieves state-of-the-art performance, matching or surpassing models 10-20x its size, making it ideal for researchers and developers seeking powerful multimodal capabilities in a compact footprint.
How It Works
The model employs a unified, fully unfrozen pre-training strategy on a 1.2T token multimodal corpus, integrating a language-aligned Perception Encoder (PE-lang, 1.8B parameters) with a Qwen3-8B decoder to establish intrinsic vision-language synergy. Frontier capabilities are unlocked via a rigorous post-training pipeline including two-stage supervised fine-tuning and over 1,400 reinforcement learning iterations. Crucially, it implements Parallel Coordinated Reasoning (PaCoRe), which allocates test-time compute to aggregate evidence from parallel visual explorations, significantly boosting complex reasoning performance.
Quick Start & Requirements
Inference is supported via the Hugging Face Transformers library. Recommended development environment: python=3.10, torch>=2.1.0, and transformers=4.57.0. Currently, only bf16 inference is supported, with multi-patch image preprocessing enabled by default.
Highlighted Details
Maintenance & Community
Community support is available via a WeChat group for technical discussions and updates.
Licensing & Compatibility
Licensed under Apache 2.0, which is generally permissive for commercial use and integration into closed-source projects.
Limitations & Caveats
Inference currently supports bf16 data type exclusively. The PaCoRe inference mode requires significantly more computational resources at test time due to parallel rollouts.
1 month ago
Inactive