Kimi-VL by MoonshotAI

Vision-language model for multimodal reasoning and agent tasks

Created 9 months ago

1,135 stars

Top 33.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

Kimi-VL is an open-source Mixture-of-Experts (MoE) vision-language model (VLM) designed for advanced multimodal reasoning, long-context understanding, and agent capabilities. It targets researchers and developers needing efficient yet powerful multimodal AI, offering state-of-the-art performance on complex tasks with a compact activated parameter footprint.

How It Works

Kimi-VL employs a Mixture-of-Experts (MoE) architecture for its language decoder, enabling efficient scaling of capabilities. It integrates a native-resolution visual encoder, MoonViT, for high-fidelity image understanding, and an MLP projector to bridge vision and language modalities. This design allows for selective activation of parameters, leading to faster inference and lower computational costs while maintaining strong performance across diverse tasks.

Quick Start & Requirements

Install: conda create -n kimi-vl python=3.10 -y, conda activate kimi-vl, pip install -r requirements.txt. For optimized inference, pip install flash-attn --no-build-isolation.
Prerequisites: Python 3.10, PyTorch 2.5.1, Transformers 4.51.3. Optional: flash-attn for performance.
Resources: Inference with device_map="auto" is recommended. Fine-tuning supports single-GPU LoRA with 50GB VRAM or multi-GPU with DeepSpeed.
Links: HuggingFace Models, Chat Web Demo, vLLM Support, LLaMA-Factory Support.

Highlighted Details

Achieves state-of-the-art results on agent interaction tasks (OSWorld).
Excels in college-level image/video comprehension, OCR, math reasoning, and multi-image understanding.
Supports a 128K context window for long documents and videos.
Native-resolution vision encoder (MoonViT) handles ultra-high-resolution inputs.
Offers "Thinking" variant fine-tuned with long CoT for advanced reasoning.

Maintenance & Community

Recent updates include support for vLLM deployment and LLaMA-Factory fine-tuning.
Technical Report available on arXiv: 2504.07491.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing terms before commercial use.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption.
While efficient, the 16B total parameter count still requires significant computational resources for full fine-tuning.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days