Showing 1 - 25 of 13525 of 135 repos
Repository | Description | Stars | Stars 7d Δ | Stars 7d % | PRs 7d Δ | Created | Response rate | Issues 30d | Last active | |
---|---|---|---|---|---|---|---|---|---|---|
1 | ten-frameworkTEN-framework | Real-time, distributed, cloud-edge collaborative multimodal AI agent framework. Supports C++/Go/Python/JS/TS. Integrates LLMs, vision, and h... | 7k Top 10% | 130 | 1.9% | 43 | 1y ago | Inactive | 6h ago | |
2 | evalscopemodelscope | Framework for model evaluation and benchmarking. Supports LLMs, multimodal, embedding, and CLIP models. Includes built-in benchmarks and met... | 1k Top 50% | 34 | 2.5% | 2 | 1y ago | 1 day | 1d ago | |
3 | Framework for holistic evaluation of foundation models (LLMs, multimodal). Includes benchmarks, models, metrics (efficiency, bias, toxicity)... | 2k Top 25% | 22 | 0.9% | 13 | 3y ago | 1 week | 16h ago | ||
4 | Library to evaluate and compare ML models. Includes implementations of metrics for NLP, CV, and other tasks. Add new evaluation modules easi... | 2k Top 25% | 8 | 0.3% | 0 | 3y ago | Inactive | 3w ago | ||
5 | 2025-GSoC-Proposal-Selectedheilcheng | Benchmark suite to evaluate Google's Gemma language models across tasks. Compares Gemma to Llama 2 and Mistral. Includes visualizations.
| 442 | 2 | 0.5% | 0 | 3mo ago | Inactive | 2mo ago | |
6 | Framework for Claude Code. Enhances with specialized commands, cognitive personas, and development methodologies. | 12k Top 5% | 1,059 | 9.7% | 8 | 1mo ago | Inactive | 1d ago | ||
7 | Framework for building LLM-as-a-judge systems. It composes judge architectural primitives for scalable oversight & automated evaluation.
| 264 | 6 | 2.3% | 0 | 7mo ago | 1 day | 2w ago | ||
8 | Library for evaluating LLM applications. It supports LLM-as-judge, RAG, code, and other evaluators. It integrates with LangSmith for experim... | 651 | 9 | 1.4% | 0 | 5mo ago | 1 week | 1mo ago | ||
9 | prometheus-evalprometheus-eval | A framework for evaluating LLMs in generation tasks. It includes tools for training, evaluating, and using language models as judges. | 977 Top 50% | 4 | 0.4% | 0 | 1y ago | 1 week | 3mo ago | |
10 | Collection of examples using Zeno to evaluate generative AI models. It is architecture agnostic, working with OpenAI, LangChain, and Hugging... | 490 | 1 | 0.2% | 0 | 2y ago | 1 day | 1y ago | ||
11 | Framework to manage & evaluate LLM-driven experiences. Standardizes API calls, evaluates outputs, & automates prompt engineering. Supports m... | 457 | 1 | 0.2% | 0 | 2y ago | 1 week | 6mo ago | ||
12 | Evaluation framework for code LLMs. Features HumanEval+ & MBPP+ for correctness, and EvalPerf for efficiency. Supports many LLM backends. | 2k Top 50% | 7 | 0.5% | 0 | 2y ago | 1 day | 3w ago | ||
13 | open-unlearninglocuslab | Framework for evaluating LLM unlearning. Supports benchmarks like TOFU/MUSE, various unlearning methods, datasets, & evaluation metrics.
| 334 | 3 | 0.9% | 1 | 1y ago | 1 day | 1w ago | |
14 | cevalhkust-nlp | Chinese evaluation suite for foundation models. It contains 13948 multi-choice questions across 52 disciplines. Evaluate zero/few-shot perfo... | 2k Top 25% | 2 | 0.1% | 0 | 2y ago | 1 week | 6d ago | |
15 | A PyTorch-based library for evaluating LLMs. It supports prompt engineering, adversarial prompts, dynamic evaluation, and multi-prompt evalu... | 3k Top 25% | 6 | 0.2% | 0 | 2y ago | Inactive | 3w ago | ||
16 | Defines desired behaviors for models via a specification. Includes evaluation prompts to test model performance in challenging situations.
| 524 | 36 | 7.1% | 0 | 5mo ago | Inactive | 3mo ago | ||
17 | Framework to evaluate LLMs and systems built using LLMs. Includes a registry of evals and the ability to write custom evals.
| 17k Top 5% | 39 | 0.2% | 0 | 2y ago | Inactive | 7mo ago | ||
18 | Open-source framework for evaluating, testing, and monitoring LLM applications. Includes tracing, prompt playground, and LLM-as-a-judge metr... | 12k Top 5% | 300 | 2.5% | 34 | 2y ago | 1 day | 16h ago | ||
19 | AlignBenchTHUDM | Benchmark for evaluating Chinese LLMs alignment. It uses multi-dimensional, rule-calibrated LLM-as-Judge evaluation with chain-of-thought an... | 401 | 2 | 0.5% | 0 | 1y ago | 1 day | 11mo ago | |
20 | OLMo-Evalallenai | Framework to evaluate language models on NLP tasks. It supports task sets, aggregate metrics, and integration with Google Sheets.
| 355 | 0 | 0% | 0 | 1y ago | Inactive | 3w ago | |
21 | HarmBenchcenterforaisafety | Framework to evaluate automated red teaming methods and LLM attacks/defenses. It supports 33 LLMs and 18 red teaming methods. | 693 | 9 | 1.3% | 0 | 1y ago | 1 day | 11mo ago | |
22 | Open-source visual environment for prompt engineering and LLM evaluation. Compare prompts/models, set up metrics, and visualize results. | 3k Top 25% | 8 | 0.3% | 0 | 2y ago | Inactive | 3d ago | ||
23 | Framework for multi-agent AI apps that can act autonomously or alongside humans. It supports agent workflows and rapid prototyping.
| 48k Top 1% | 312 | 0.6% | 21 | 1y ago | 1 day | 16h ago | ||
24 | PandaLMWeOpenML | Automated LLM evaluation benchmark. It compares responses, gives reasons, and includes a human-annotated dataset for validation and training... | 921 Top 50% | 1 | 0.1% | 0 | 2y ago | 1 day | 1y ago | |
25 | Unified toolkit for evaluating post-trained language models. Supports parallel evaluation, simplified usage, and standardized result managem... | 492 | 7 | 1.4% | 0 | 8mo ago | 1 week | 1mo ago |
Showing 1 - 25 of 13525 of 135 repos