Discover and explore top open-source AI tools and projects—updated daily.
MoonshotAIVerifying Kimi K2 API vendor toolcall precision and reliability
Top 67.5% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> K2 Vendor Verifier (K2VV) addresses inconsistent tool call precision across Kimi K2 API providers, a critical issue impacting user experience and benchmarks. It helps users prioritize accuracy over latency/cost by providing objective performance data. This benefits developers and researchers evaluating API providers for reliable Kimi K2 model integration.
How It Works
<2-4 sentences on core approach / design (key algorithms, models, data flow, or architectural choices) and why this approach is advantageous or novel.> K2VV systematically evaluates K2 API providers using 4,000 tool call requests, comparing responses against the official Moonshot AI API. Key metrics include finish reason counts, schema validation errors, and successful tool calls. A novel "Similarity to Official API" score, derived from Euclidean distance of provider metrics, quantifies accuracy and reliability, enabling informed provider selection.
Quick Start & Requirements
To run evaluations:
python tool_calls_eval.py samples.jsonl --model kimi-k2-0905-preview --base-url https://api.moonshot.cn/v1 --api-key YOUR_API_KEY --concurrency 5 --output results.jsonl --summary summary.json
Requires Python and the samples.jsonl dataset; an API key is necessary. Supports testing via OpenRouter with --extra-body. Moderate resource footprint for script execution and API calls.
Highlighted Details
Bullet 1 (benchmarks, performance claims, novel integration, etc.)
Bullet 2
Bullet 3
Bullet 4 (optional)
The official Moonshot AI Turbo implementation achieved 99.26% similarity.
Other providers show varied similarity: NovitaAI (92.87%), Groq (89.39%), Fireworks (88.83%).
Providers like Nebius (70.49%), Chutes (49.12%), and AtlasCloud (48.93%) exhibit significantly lower similarity.
Evaluation metrics cover finish reasons, schema validation errors, and successful tool calls over a 4,000-request test set.
Maintenance & Community
Contact: shijuanfeng@moonshot.cn for inquiries. The project is preparing for new benchmarks and welcomes community input on metrics, test cases, or vendors via GitHub issues.
Licensing & Compatibility
No license information is provided in the README.
Limitations & Caveats
<1-3 sentences on caveats: unsupported platforms, missing features, alpha status, known bugs, breaking changes, bus factor, deprecation, etc. Avoid vague non-statements and judgments.>
Evaluations are specific to the kimi-k2-0905-preview model and samples.jsonl dataset; performance may vary with different versions or data. The vendor list is not exhaustive, and new providers can be requested. The "Similarity to Official API" is a custom metric based on aggregated provider metrics.
1 week ago
Inactive
kagisearch
groq
stanford-crfm