Multimodal reasoning model for efficient vision-language tasks
New!
Top 73.5% on SourcePulse
Step3 is a 321B parameter multimodal reasoning model designed for efficient vision-language tasks. It targets researchers and developers seeking high performance with reduced decoding costs, leveraging a novel Mixture-of-Experts architecture and co-designed attention mechanisms.
How It Works
Step3 employs a Mixture-of-Experts (MoE) architecture with 48 experts, activating 3 per token, resulting in 38B active parameters out of 321B total. It utilizes Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD) to minimize decoding costs and enhance efficiency across various accelerators. This co-design approach aims for top-tier performance in vision-language reasoning.
Quick Start & Requirements
Model checkpoints are available on Huggingface in bf16 and block-fp8 formats. Recommended inference engines include vLLM and SGLang. Deployment and request examples are provided in the Model Deployment Guide.
Highlighted Details
Maintenance & Community
Contact is available via email at contact@stepfun.com. The project cites a technical report and blog post.
Licensing & Compatibility
Both code and model weights are released under the Apache License 2.0, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The model is presented with a technical report introduction, suggesting it may be in an early stage of public release or documentation. Specific hardware requirements for optimal performance are not detailed beyond mentioning compatibility with "low-end accelerators."
6 days ago
Inactive