Kimi-VL  by MoonshotAI

Vision-language model for multimodal reasoning and agent tasks

created 3 months ago
1,017 stars

Top 37.4% on sourcepulse

GitHubView on GitHub
Project Summary

Kimi-VL is an open-source Mixture-of-Experts (MoE) vision-language model (VLM) designed for advanced multimodal reasoning, long-context understanding, and agent capabilities. It targets researchers and developers needing efficient yet powerful multimodal AI, offering state-of-the-art performance on complex tasks with a compact activated parameter footprint.

How It Works

Kimi-VL employs a Mixture-of-Experts (MoE) architecture for its language decoder, enabling efficient scaling of capabilities. It integrates a native-resolution visual encoder, MoonViT, for high-fidelity image understanding, and an MLP projector to bridge vision and language modalities. This design allows for selective activation of parameters, leading to faster inference and lower computational costs while maintaining strong performance across diverse tasks.

Quick Start & Requirements

  • Install: conda create -n kimi-vl python=3.10 -y, conda activate kimi-vl, pip install -r requirements.txt. For optimized inference, pip install flash-attn --no-build-isolation.
  • Prerequisites: Python 3.10, PyTorch 2.5.1, Transformers 4.51.3. Optional: flash-attn for performance.
  • Resources: Inference with device_map="auto" is recommended. Fine-tuning supports single-GPU LoRA with 50GB VRAM or multi-GPU with DeepSpeed.
  • Links: HuggingFace Models, Chat Web Demo, vLLM Support, LLaMA-Factory Support.

Highlighted Details

  • Achieves state-of-the-art results on agent interaction tasks (OSWorld).
  • Excels in college-level image/video comprehension, OCR, math reasoning, and multi-image understanding.
  • Supports a 128K context window for long documents and videos.
  • Native-resolution vision encoder (MoonViT) handles ultra-high-resolution inputs.
  • Offers "Thinking" variant fine-tuned with long CoT for advanced reasoning.

Maintenance & Community

  • Recent updates include support for vLLM deployment and LLaMA-Factory fine-tuning.
  • Technical Report available on arXiv: 2504.07491.

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README. Users should verify licensing terms before commercial use.

Limitations & Caveats

  • The README does not specify a license, which may impact commercial adoption.
  • While efficient, the 16B total parameter count still requires significant computational resources for full fine-tuning.
Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
8
Star History
213 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.