K2-Vendor-Verifier  by MoonshotAI

Verifying Kimi K2 API vendor toolcall precision and reliability

Created 2 months ago
443 stars

Top 67.5% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> K2 Vendor Verifier (K2VV) addresses inconsistent tool call precision across Kimi K2 API providers, a critical issue impacting user experience and benchmarks. It helps users prioritize accuracy over latency/cost by providing objective performance data. This benefits developers and researchers evaluating API providers for reliable Kimi K2 model integration.

How It Works

<2-4 sentences on core approach / design (key algorithms, models, data flow, or architectural choices) and why this approach is advantageous or novel.> K2VV systematically evaluates K2 API providers using 4,000 tool call requests, comparing responses against the official Moonshot AI API. Key metrics include finish reason counts, schema validation errors, and successful tool calls. A novel "Similarity to Official API" score, derived from Euclidean distance of provider metrics, quantifies accuracy and reliability, enabling informed provider selection.

Quick Start & Requirements

  • Primary install / run command (pip, Docker, binary, etc.).
  • Non-default prerequisites and dependencies (GPU, CUDA >= 12, Python 3.12, large dataset, API keys, OS, hardware, etc.).
  • Estimated setup time or resource footprint.
  • If they are present, include links to official quick-start, docs, demo, or other relevant pages.

To run evaluations: python tool_calls_eval.py samples.jsonl --model kimi-k2-0905-preview --base-url https://api.moonshot.cn/v1 --api-key YOUR_API_KEY --concurrency 5 --output results.jsonl --summary summary.json

Requires Python and the samples.jsonl dataset; an API key is necessary. Supports testing via OpenRouter with --extra-body. Moderate resource footprint for script execution and API calls.

Highlighted Details

  • Bullet 1 (benchmarks, performance claims, novel integration, etc.)

  • Bullet 2

  • Bullet 3

  • Bullet 4 (optional)

  • The official Moonshot AI Turbo implementation achieved 99.26% similarity.

  • Other providers show varied similarity: NovitaAI (92.87%), Groq (89.39%), Fireworks (88.83%).

  • Providers like Nebius (70.49%), Chutes (49.12%), and AtlasCloud (48.93%) exhibit significantly lower similarity.

  • Evaluation metrics cover finish reasons, schema validation errors, and successful tool calls over a 4,000-request test set.

Maintenance & Community

  • Notable contributors, sponsorships, partnerships, deprecations, migrations, or other health signals if notable.
  • Links to Discord/Slack, social handles, roadmap, etc.

Contact: shijuanfeng@moonshot.cn for inquiries. The project is preparing for new benchmarks and welcomes community input on metrics, test cases, or vendors via GitHub issues.

Licensing & Compatibility

  • License type and notable restrictions (GPL -> copyleft, SSPL, etc.).
  • Compatibility notes for commercial use or closed-source linking.

No license information is provided in the README.

Limitations & Caveats

<1-3 sentences on caveats: unsupported platforms, missing features, alpha status, known bugs, breaking changes, bus factor, deprecation, etc. Avoid vague non-statements and judgments.> Evaluations are specific to the kimi-k2-0905-preview model and samples.jsonl dataset; performance may vary with different versions or data. The vendor list is not exhaustive, and new providers can be requested. The "Similarity to Official API" is a custom metric based on aggregated provider metrics.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
8
Star History
130 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.