Kimi-K2.5 by MoonshotAI

Multimodal agentic AI for vision-grounded reasoning and task execution

Created 2 weeks ago

New!

1,027 stars

Top 36.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

Kimi K2.5 is an open-source, native multimodal agentic model designed for complex tasks requiring integrated vision and language understanding. It targets developers and researchers seeking advanced capabilities in visual reasoning, agentic tool use, and coordinated task execution. The model offers a significant benefit through its ability to process visual inputs, generate code from visual specifications, and orchestrate dynamic agent swarms for self-directed problem-solving.

How It Works

Built upon Kimi-K2-Base, K2.5 is continually pre-trained on approximately 15 trillion mixed visual and text tokens, enabling native multimodality. Its architecture employs a Mixture-of-Experts (MoE) design with 1 trillion total parameters (32B activated) and a 256K token context length. A key innovation is the "Agent Swarm" capability, which decomposes complex tasks into parallel sub-tasks executed by dynamically instantiated, domain-specific agents, moving beyond single-agent scaling. It also features "Coding with Vision," allowing code generation from visual inputs like UI designs.

Quick Start & Requirements

Access Kimi K2.5 via its official API at https://platform.moonshot.ai, offering OpenAI/Anthropic compatibility. Recommended inference engines include vLLM, SGLang, and KTransformers, requiring transformers version 4.57.1 or higher. Specific hardware (e.g., GPU, VRAM) or OS requirements are not detailed, but typical for large model inference. Deployment examples and guides are available.

Highlighted Details

Demonstrates competitive performance across numerous benchmarks, including Reasoning & Knowledge, Image & Video understanding, and Agentic Search tasks, often outperforming or matching leading proprietary models.
Features native multimodality, processing both visual and text data seamlessly, with a 256K token context length.
Introduces an "Agent Swarm" paradigm for self-directed, coordinated execution of complex tasks by dynamically formed agent groups.
Supports "Coding with Vision," enabling code generation directly from visual specifications like UI mockups.

Maintenance & Community

The project provides a contact email (support@moonshot.cn) for inquiries. No specific details on contributors, sponsorships, or community channels (like Discord/Slack) are present in the README.

Licensing & Compatibility

Released under the Modified MIT License. This license generally permits commercial use and modification, but users should review its specific terms for any potential restrictions.

Limitations & Caveats

Chatting with video content is an experimental feature currently limited to the official API. Certain coding benchmarks (Terminal-Bench 2.0, SWE-Bench) were evaluated in non-thinking mode due to context management incompatibilities. Some benchmark evaluations for other models faced stability issues or were re-evaluated under specific conditions, potentially affecting direct comparisons.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1,049 stars in the last 19 days