Vision-language model for multimodal reasoning and agent tasks
Top 37.4% on sourcepulse
Kimi-VL is an open-source Mixture-of-Experts (MoE) vision-language model (VLM) designed for advanced multimodal reasoning, long-context understanding, and agent capabilities. It targets researchers and developers needing efficient yet powerful multimodal AI, offering state-of-the-art performance on complex tasks with a compact activated parameter footprint.
How It Works
Kimi-VL employs a Mixture-of-Experts (MoE) architecture for its language decoder, enabling efficient scaling of capabilities. It integrates a native-resolution visual encoder, MoonViT, for high-fidelity image understanding, and an MLP projector to bridge vision and language modalities. This design allows for selective activation of parameters, leading to faster inference and lower computational costs while maintaining strong performance across diverse tasks.
Quick Start & Requirements
conda create -n kimi-vl python=3.10 -y
, conda activate kimi-vl
, pip install -r requirements.txt
. For optimized inference, pip install flash-attn --no-build-isolation
.flash-attn
for performance.device_map="auto"
is recommended. Fine-tuning supports single-GPU LoRA with 50GB VRAM or multi-GPU with DeepSpeed.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 weeks ago
1 day