K2-Vendor-Verifier by MoonshotAI

Verifying Kimi K2 API vendor toolcall precision and reliability

Created 6 months ago

524 stars

Top 60.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Theo Browne

Founder of Ping.gg

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> K2 Vendor Verifier (K2VV) addresses inconsistent tool call precision across Kimi K2 API providers, a critical issue impacting user experience and benchmarks. It helps users prioritize accuracy over latency/cost by providing objective performance data. This benefits developers and researchers evaluating API providers for reliable Kimi K2 model integration.

How It Works

<2-4 sentences on core approach / design (key algorithms, models, data flow, or architectural choices) and why this approach is advantageous or novel.> K2VV systematically evaluates K2 API providers using 4,000 tool call requests, comparing responses against the official Moonshot AI API. Key metrics include finish reason counts, schema validation errors, and successful tool calls. A novel "Similarity to Official API" score, derived from Euclidean distance of provider metrics, quantifies accuracy and reliability, enabling informed provider selection.

Quick Start & Requirements

Primary install / run command (pip, Docker, binary, etc.).
Non-default prerequisites and dependencies (GPU, CUDA >= 12, Python 3.12, large dataset, API keys, OS, hardware, etc.).
Estimated setup time or resource footprint.
If they are present, include links to official quick-start, docs, demo, or other relevant pages.

To run evaluations: python tool_calls_eval.py samples.jsonl --model kimi-k2-0905-preview --base-url https://api.moonshot.cn/v1 --api-key YOUR_API_KEY --concurrency 5 --output results.jsonl --summary summary.json

Requires Python and the samples.jsonl dataset; an API key is necessary. Supports testing via OpenRouter with --extra-body. Moderate resource footprint for script execution and API calls.

Highlighted Details

Bullet 1 (benchmarks, performance claims, novel integration, etc.)
Bullet 2
Bullet 3
Bullet 4 (optional)
The official Moonshot AI Turbo implementation achieved 99.26% similarity.
Other providers show varied similarity: NovitaAI (92.87%), Groq (89.39%), Fireworks (88.83%).
Providers like Nebius (70.49%), Chutes (49.12%), and AtlasCloud (48.93%) exhibit significantly lower similarity.
Evaluation metrics cover finish reasons, schema validation errors, and successful tool calls over a 4,000-request test set.

Maintenance & Community

Notable contributors, sponsorships, partnerships, deprecations, migrations, or other health signals if notable.
Links to Discord/Slack, social handles, roadmap, etc.

Contact: shijuanfeng@moonshot.cn for inquiries. The project is preparing for new benchmarks and welcomes community input on metrics, test cases, or vendors via GitHub issues.

Licensing & Compatibility

License type and notable restrictions (GPL -> copyleft, SSPL, etc.).
Compatibility notes for commercial use or closed-source linking.

No license information is provided in the README.

Limitations & Caveats

<1-3 sentences on caveats: unsupported platforms, missing features, alpha status, known bugs, breaking changes, bus factor, deprecation, etc. Avoid vague non-statements and judgments.> Evaluations are specific to the kimi-k2-0905-preview model and samples.jsonl dataset; performance may vary with different versions or data. The vendor list is not exhaustive, and new providers can be requested. The "Similarity to Official API" is a custom metric based on aggregated provider metrics.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days